Ask the Expert

Removing duplicate records

My database has several duplicate records, and I want to remove those duplicate records, despite the fact that one of those will remain in the database.

    Requires Free Membership to View

The fact that you have duplicate rows indicates you have no primary or unique key which is also an indication of bad database design. The easiest way to solve this problem is to add a unique key to your table, either temporarily or permanently, so that individual rows can be identified.

So if for example you have a table with the following structure

 CREATE TABLE tblDuplicates ( col1 INT, col2 CHAR(2), col3 MONEY )
And you INSERT the following duplicate rows within your database:
 INSERT tblDuplicates VALUES(1,'AB',123) INSERT tblDuplicates VALUES(1,'AB',123) INSERT tblDuplicates VALUES(1,'AB',123) INSERT tblDuplicates VALUES(2,'CD',456) INSERT tblDuplicates VALUES(2,'CD',456) INSERT tblDuplicates VALUES(3,'EF',789) INSERT tblDuplicates VALUES(4,'GH',912)
When we SELECT from this table we can see more clearly that we have a problem.
 SELECT * FROM tblDuplicates col1 col2 col3 ----------- ---- --------------------- 1 AB 123.0000 1 AB 123.0000 1 AB 123.0000 2 CD 456.0000 2 CD 456.0000 3 EF 789.0000 4 GH 912.0000
However now if we add a column to our table that contains a unique value for every row we will be able to identify each row individually. The easiest way to do this is with an IDENTITY column.
 ALTER TABLE tblDuplicates ADD colUnique INT IDENTITY(1,1)
Now when we SELECT from our table we can see that the new column in each row, including the duplicate rows, has a unique value.
 SELECT * FROM tblDuplicates col1 col2 col3 colUnique ----------- ---- --------------------- ----------- 1 AB 123.0000 1 1 AB 123.0000 2 1 AB 123.0000 3 2 CD 456.0000 4 2 CD 456.0000 5 3 EF 789.0000 6 4 GH 912.0000 7
Now we can identify each row individually we can do something about removing the duplicated rows. The following DELETE statement removes all rows that have columns that match exactly with another row, except the row that has the minimum colUnqiue value (thereby always leaving one row remaining after the delete for each set of duplicated rows).
 DELETE tblDuplicates FROM tblDuplicates td1 WHERE colUnique<>( SELECT MIN(colUnique) FROM tblDuplicates td2 WHERE td1.Col1 = td2.Col1 AND td1.Col2 = td2.Col2 AND td1.Col3 = td2.Col3 )
We can now remove our temporary unique column:
 ALTER TABLE tblDuplicates DROP COLUMN colUnique
And again we SELECT from our table to retrieve all new set of non-duplicated rows.
 SELECT * FROM tblDuplicates col1 col2 col3 ----------- ---- --------------------- 1 AB 123.0000 2 CD 456.0000 3 EF 789.0000 4 GH 912.0000
However as our table doesn't enforce uniqueness this situation is likely to happen again. You best bet is to rework you database design to include a permanent primary key on the table so you are always able to individually identify rows. And always, always make sure you have a good backup before running this type of operation on your tables as it is easy to make a mistake.

This was first published in June 2002

There are Comments. Add yours.

 
TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: