The sp_FindDuplicateRows routine finds cases where multiple rows match across a specified set of criteria (columns) in a selected table.
The sp_FindDuplicateRows stored procedure is useful for finding duplicate data. In order for the routine to identify rows that are actually duplicates from a business perspective, it's important to specify an appropriate column list (the @DBThere parameter). A column or a list of columns that represents an alternate key is often a good choice. For example, a name column would not be used as the primary key in this architecture and it may not be practical to place a unique constraint on such a column. However, it may be typical to look for duplicates in such a column. Any column or any list of columns that usually (but not always) holds unique data might be of interest to check.
The proponents of natural data keys might say that the sp_FindDuplicateRows routine is necessitated by the use of surrogate keys. They seem to believe that natural data keys would ensure no duplicate rows could exist. That's nonsense. Any column (or set of columns) that would be the primary key in a natural data key system could easily be the target of a unique constraint in a surrogate key system. The systems would be equally susceptible to duplicate rows. A lot of data (possibly the vast majority of data) is entered into databases by humans, and humans make mistakes. An employee table could have two rows with the same name and address where the Social Security number differs by one digit. It does not matter whether the Social Security number is the primary key or the target of a unique constraint (with an IDENTITY value as the primary key). A book table could have two rows with the same title and author where the ISBN differs by one digit. It does not matter whether the ISBN is the primary key or the target of a unique constraint (with an IDENTITY value as the primary key). The use of check digits can reduce the number of such errors, but it does not eliminate them. Both natural data key systems and surrogate key systems can contain duplicate rows.
CREATE PROCEDURE dbo.sp_FindDuplicateRows @DBTable varchar(100), @DBWhere varchar(2000) = NULL, @DBThere varchar(2000) = NULL, @DBUltra bit = 0
The first parameter (@DBTable) specifies the table to be analyzed. The table must follow the rules of the surrogate key architecture suggested in my earlier article.
The second parameter (@DBWhere) is optional and it provides a way to limit the rows to be analyzed. The parameter value is used as a WHERE clause.
The third parameter (@DBThere) is optional and it specifies the criteria (columns) to be considered for matching. The parameter value, typically a list of column names separated by pipes (vertical bars), is used to form a GROUP BY clause and an ORDER BY clause. Expressions involving columns can be used in place of column names in the list. The default is to use all columns with "Name" in their name or all columns except the first column.
The fourth parameter (@DBUltra) is optional and it determines whether summary information is returned or matching rows are returned. A value of zero (0) means the match criteria and match counts are returned and a value of one (1) means the matching rows themselves are returned along with reference information.
This example finds duplicate rows in the Product table in my example database. It returns a summary of the findings. The column used in this example does not represent an actual duplication of data from a business perspective, but it demonstrates how the stored procedure works.
EXECUTE sp_FindDuplicateRows 'Product',NULL,'VendorID',0
This example finds duplicate rows in the Product table in my example database. It returns a summary of the findings. The combination of columns used in this example returns an empty result set because there is no duplication of data in those columns.
EXECUTE sp_FindDuplicateRows 'Product',NULL,'VendorID|Description',0
This example finds duplicate rows in the Product table in my example database. It returns a result set including the rows found to be duplicates. The combination of a column and an expression used in this example does not represent an actual duplication of data from a business perspective, but it demonstrates how the stored procedure works.
EXECUTE sp_FindDuplicateRows 'Product',NULL,'VendorID|LEFT(Description,7)',1
Utilities to maintain a surrogate key architecture
Part 1: What are surrogate key values
Part 2: T-SQL code to create stored procedures
Part 3: Utility routine: sp_CheckRowCounts
Part 4: Utility routine: sp_CheckKeyValues
Part 5: Utility routine: sp_ResetNextKeyValue
Part 6: Utility routine: sp_OrganizeKeyValues
Part 7: Utility routine: sp_FindDuplicateRows
Part 8: Utility routine: sp_ChangeParentValue
|ABOUT THE AUTHOR:|
| Brian Walker
Brian Walker is a senior database architect in an IS department that uses SQL Server 2000 and the .NET Framework. He has more than 25 years of experience in the IT industry with the last several years focused on databases and SQL Server. Walker is a software developer, database developer, database administrator and database consultant. He develops utility software as a hobby, including a large collection of SQL Server utilities.
Copyright 2006 TechTarget