Duplicate records refer to the existence of multiple rows with exactly the same or partial fields in the database table. Recognizing and deleting duplicate records in SQL can be found through GROUP BY and HAVING clauses, and deleted using different methods. 1. Repeated records are divided into two types: complete repetition and partial repetition; 2. Use SELECT to combine GROUP BY and HAVING COUNT(*) > 1 to identify duplicates; 3. Use CTE and ROW_NUMBER() functions to keep a record and delete the remaining duplicates; 4. Complete repetition records can be deleted through temporary table deduplication and reinsert; 5. Repetitive records with explicit ID can also be deleted directly; 6. Before deletion, data must be backed up, performance impacts and legality must be judged based on business logic.
Duplicate records are a common but prone to problems when dealing with databases. They may affect statistical results, cause data analysis bias, and even affect the correctness of business logic. So identifying and deleting duplicate records in SQL is a basic but important task.

What is duplicate record?
"Repeat record" usually refers to the presence of multiple rows in a table with exactly the same or partially the same fields. For example, in the user information table, if the names, emails and phone numbers of the two records are the same, it is likely to be duplicate data. The key to determining whether it is a duplicate is the "unique identification field" you define.

Common types of repetition are:
- Completely repeat: All field values ??are the same
- Partial duplication: The key fields (such as name, email) are the same, other fields are different
How to identify duplicate records?
The key to identifying duplicate records is to use GROUP BY
with the HAVING
clause to find data combinations with occurrences greater than 1.

Suppose there is a table named users
, the structure is as follows:
id | name | email | phone ---|---------------------------------------------------------------------------------------------------------------------------- 1 | Alice | alice@example.com | 123456789 2 | Alice | alice@example.com | 987654321 3 | Bob | bob@example.com | 111222333
You can find duplicate name and email combinations through the following query:
SELECT name, email, COUNT(*) FROM users GROUP BY name, email HAVING COUNT(*) > 1;
This statement returns those duplicate combinations of name
and email
and shows that they appear several times.
If you want to know the specific content of these duplicate records, you can write it like this:
SELECT u.* FROM users u JOIN ( SELECT name, email FROM users GROUP BY name, email HAVING COUNT(*) > 1 ) dup ON u.name = dup.name AND u.email = dup.email ORDER BY u.name, u.email;
How to delete duplicate records?
The way you delete duplicate records depends on your needs. Here are a few common practices:
Method 1: Keep one and delete the remaining duplicates
If you want to keep only one of each set of duplicate records, you can use the ROW_NUMBER()
function to mark duplicates and then delete the redundant records (suitable for databases that support window functions, such as MySQL 8, PostgreSQL, SQL Server, etc.):
WITH cte AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY name, email ORDER BY id) AS rn FROM users ) DELETE FROM users WHERE id IN ( SELECT id FROM cte WHERE rn > 1 );
This code will number the records in each duplicate group. If the number is greater than 1, it will be deleted, that is, the first record in each group is retained (which is the first one based on ORDER BY id
).
Method 2: Delete completely duplicate records
If the data in the entire row is the same, you can use temporary table dere-insert and then overwrite the original table:
-- Create a temporary table and insert the deduplication data CREATE TEMPORARY TABLE temp_users AS SELECT DISTINCT * FROM users; -- Clear the original table DELETE FROM users; -- Insert the data after deduplication INSERT INTO users SELECT * FROM temp_users; -- Delete temporary table DROP TABLE temp_users;
Note: This method is suitable for small tables and is less efficient for large tables.
Method 3: Manually delete the record with the specified ID
If you already know which records are duplicate, you can directly delete records with a specific ID using the DELETE
statement:
DELETE FROM users WHERE id IN (2, 4, 6);
Some things to note
- Backup data : Be sure to back up the data before performing the deletion operation to avoid mistaken deletion.
- Indexing and performance : When performing duplicate detection or deletion on large tables, it may consume a lot of resources and is recommended during low peak periods.
- Consider business logic : Some "repetitions" may be legal, such as the same person has two mobile phone numbers, and it is necessary to make a judgment based on business.
Basically that's it. Identifying and deleting duplicate records is not complicated, but details are easy to ignore, especially when operating in a production environment, and they need to be treated with caution.
The above is the detailed content of Identifying and removing duplicate records in SQL.. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

OLTPisusedforreal-timetransactionprocessing,highconcurrency,anddataintegrity,whileOLAPisusedfordataanalysis,reporting,anddecision-making.1)UseOLTPforapplicationslikebankingsystems,e-commerceplatforms,andCRMsystemsthatrequirequickandaccuratetransactio

Toduplicateatable'sstructurewithoutcopyingitscontentsinSQL,use"CREATETABLEnew_tableLIKEoriginal_table;"forMySQLandPostgreSQL,or"CREATETABLEnew_tableASSELECT*FROMoriginal_tableWHERE1=2;"forOracle.1)Manuallyaddforeignkeyconstraintsp

To improve pattern matching techniques in SQL, the following best practices should be followed: 1. Avoid excessive use of wildcards, especially pre-wildcards, in LIKE or ILIKE, to improve query efficiency. 2. Use ILIKE to conduct case-insensitive searches to improve user experience, but pay attention to its performance impact. 3. Avoid using pattern matching when not needed, and give priority to using the = operator for exact matching. 4. Use regular expressions with caution, as they are powerful but may affect performance. 5. Consider indexes, schema specificity, testing and performance analysis, as well as alternative methods such as full-text search. These practices help to find a balance between flexibility and performance, optimizing SQL queries.

IF/ELSE logic is mainly implemented in SQL's SELECT statements. 1. The CASEWHEN structure can return different values ??according to the conditions, such as marking Low/Medium/High according to the salary interval; 2. MySQL provides the IF() function for simple choice of two to judge, such as whether the mark meets the bonus qualification; 3. CASE can combine Boolean expressions to process multiple condition combinations, such as judging the "high-salary and young" employee category; overall, CASE is more flexible and suitable for complex logic, while IF is suitable for simplified writing.

The method of obtaining the current date and time in SQL varies from database system. The common methods are as follows: 1. MySQL and MariaDB use NOW() or CURRENT_TIMESTAMP, which can be used to query, insert and set default values; 2. PostgreSQL uses NOW(), which can also use CURRENT_TIMESTAMP or type conversion to remove time zones; 3. SQLServer uses GETDATE() or SYSDATETIME(), which supports insert and default value settings; 4. Oracle uses SYSDATE or SYSTIMESTAMP, and pay attention to date format conversion. Mastering these functions allows you to flexibly process time correlations in different databases

The DISTINCT keyword is used in SQL to remove duplicate rows in query results. Its core function is to ensure that each row of data returned is unique and is suitable for obtaining a list of unique values ??for a single column or multiple columns, such as department, status or name. When using it, please note that DISTINCT acts on the entire row rather than a single column, and when used in combination with multiple columns, it returns a unique combination of all columns. The basic syntax is SELECTDISTINCTcolumn_nameFROMtable_name, which can be applied to single column or multiple column queries. Pay attention to its performance impact when using it, especially on large data sets that require sorting or hashing operations. Common misunderstandings include the mistaken belief that DISTINCT is only used for single columns and abused in scenarios where there is no need to deduplicate D

Create temporary tables in SQL for storing intermediate result sets. The basic method is to use the CREATETEMPORARYTABLE statement. There are differences in details in different database systems; 1. Basic syntax: Most databases use CREATETEMPORARYTABLEtemp_table (field definition), while SQLServer uses # to represent temporary tables; 2. Generate temporary tables from existing data: structures and data can be copied directly through CREATETEMPORARYTABLEAS or SELECTINTO; 3. Notes include the scope of action is limited to the current session, rename processing mechanism, performance overhead and behavior differences in transactions. At the same time, indexes can be added to temporary tables to optimize

The main difference between WHERE and HAVING is the filtering timing: 1. WHERE filters rows before grouping, acting on the original data, and cannot use the aggregate function; 2. HAVING filters the results after grouping, and acting on the aggregated data, and can use the aggregate function. For example, when using WHERE to screen high-paying employees in the query, then group statistics, and then use HAVING to screen departments with an average salary of more than 60,000, the order of the two cannot be changed. WHERE always executes first to ensure that only rows that meet the conditions participate in the grouping, and HAVING further filters the final output based on the grouping results.
