Why use surrogate keys

Contributor Brian Walker explains the benefits of using a surrogate key database architecture, such as stability, performance, simplicity and consistency.

The alternative to surrogate keys is natural data keys. In order to understand the many substantial benefits of surrogate keys it's necessary to discuss some background on the issues involved.Some say that a natural data key system has the advantage of inherent enforcement of referential integrity. They say it's simply impossible to insert a child row that's a mismatch with the parent row. Is it really? Some say that a surrogate key system has the downfall of providing no referential integrity. They say it's alone in being susceptible to inserting a child row that's a mismatch with the parent row. Is it really?

Envision an application with a grid of parent rows. Click on a parent row to enter a new child row. The entry form displays the appropriate parent row data and accepts input of child row data. Click "Save" and all is well, right? Yikes! The developer has a bug in his or her code. The application prepares to save the child row and accidentally grabs data from the parent row immediately above the correct row in the grid. The new child row contains a foreign key connecting it to the wrong parent row! Does it make any difference at all whether the data grabbed incorrectly is a 4-column natural data key or a 1-column surrogate key?

The proponents of natural data keys say we should use industry standard codes as primary keys. That's supposed to minimize the need to change key data and possibly minimize the width of foreign keys. I think it would be highly unlikely to find industry standard codes for anything more than a few high level tables in most databases. It's quite probable that most of the lower level tables would have composite primary keys. The lowest layers could easily have composite primary keys consisting of many columns.

I work with a mission-critical corporate database every day. The database is only modest size, but yet it's up to 14 levels deep. There are few applicable industry standard codes. Most of the lower level tables would require very wide natural data primary keys consisting of several columns. The table that contains the largest number of rows would require a composite primary key consisting of six columns.

These issues would be worrisome if the corporate database used natural data keys. It's not. It's using single column surrogate keys. Every table has an integer column with the IDENTITY property and that column is the primary key. Every foreign key is the single column primary key from the parent table. Every join between a parent table and a child table can be coded with one predictable column from each table. This architecture provides substantial benefits. The primary key values never change, so foreign key values are stable. All of the primary key and foreign key indexes are compact, which is very good for join performance (there are some additional indexes on selected natural data columns). The SQL code expression of a relationship between any two tables is simple and very consistent.

I believe that surrogate keys have stability and performance advantages over natural data keys in most current relational database systems. However, it's the consistency of universally applied surrogate keys that makes the architecture so appealing to me. A given query in a natural data key architecture might join together eight tables with the join structure being radically different between each pair because of how the relationships (foreign key references) are defined. The same query in a surrogate key architecture might join together 10 tables, but the join structure would be short and predictable between each pair.

The consistency of a surrogate key architecture makes life easier for developers. I believe they can write data retrieval code faster and with fewer bugs. With surrogate keys there is only one column from each table involved in most joins, and that column is obvious. With natural data composite keys there are several columns from each table involved in most joins, and those columns are not obvious. Surely that must make it more tedious to write the join structure and more difficult to get it correct; forgetting to include one column in the join could have disastrous consequences. Many DBAs become indignant with the thought of making architecture decisions with developers in mind. I see nothing wrong with decisions that do not compromise the database and allow developers to be more productive.

The consistency of a surrogate key architecture makes it practical to have generic routines that perform very powerful data manipulation for maintenance purposes. Among many other things, such routines allow DBAs to perform bulk operations (copy/delete/merge) on sets of relational data with a simple stored procedure call. The popular database utility applications do not even attempt such operations because many of the existing databases do not possess sufficient consistency in their architecture. A consistent architecture also makes it easier to create a variety of code generation and administrative tools.

Envision yourself as a DBA for a company. Your company has other companies as customers. Each customer has several contacts. Each contact has several addresses and phone numbers. Each phone number has a call log. Each customer places many orders. Each order includes products. Some products have a service log. Each log entry can involve several technicians. Each technician … Well, you get the idea. Imagine that there are 10 levels of descendent tables below customers in your database. Now suppose one of the customer companies goes out of business and you want to copy their data into an archive database and remove it from the production database. How would you perform the copy? I have asked several database professionals such a question and I have received a variety of answers. Some suggested using DTS. Some write SQL code. Others use an INSERT statement generator. All of these are painful options. I'm aware of only one commercial product that claims to be able to perform such a copy, and that product is extremely expensive. Here's what I would do with our corporate database:

EXECUTE CopyBranch 'Archive','Production','Customer','CustomerID',@CustomerID

The CopyBranch stored procedure is a homegrown generic routine that can copy any branch of our corporate database. In this example it starts with a single row in the Customer table and copies all associated rows from all descendent tables. It does not matter how many tables are involved or how deeply the structure is layered. This routine is a very powerful tool for us. We can easily copy subsets of related data into an archive database, a development database, a QA database or a demo database. This capability was made possible by having a very consistent key architecture.

Now that I have provided some context I will describe a suggested surrogate key architecture. The benefits are equally applicable to relational databases in all environments.

A surrogate key architecture for powerful database operations

More advice: Framework to support a surrogate key architecture

Brian Walker is a senior database architect in an IS department that uses SQL Server 2000 and the .NET Framework. He has more than 25 years of experience in the IT industry with the last several years focused on databases and SQL Server. Walker is a software developer, database developer, database administrator and database consultant. He develops utility software as a hobby, including a large collection of SQL Server utilities.

Dig Deeper on SQL Server Database Modeling and Design