Clustered indexes in SQL Server are a critical consideration in the overall architecture of the database. They...
are often overlooked, misunderstood or, if the database is small, considered unimportant.
This article points out the importance of clustered indexes for overall system performance and maintenance as your database grows. I will briefly cover how SQL Server clustered indexes are stored on disk, why they should always increase over time and why it is best that clustered indexes be static. I'll also touch on many-to-many tables, why they are used and how clustered indexes make these tables more efficient.
Finally, it is absolutely critical that we touch on the new SQL Server 2005 partitioned table concept and examine how partitioned tables affect clustered indexes. This will help you make the right decisions from the very start.
Clustered indexes are created by default to match the primary key, which is defined on tables in SQL Server. However, you can create a clustered index on any column and then define a primary key on a separate column or columns. At this point, the primary key would be created as a unique non-clustered index. Typically, a clustered index will match the primary key, but not necessarily, so be careful. Given the variety of situations that can arise, I'll be discussing the clustered indexes themselves, and for now ignore whether you choose to make them primary keys.
Clustered indexes actually hold the row data for SQL Server, so wherever your clustered indexes are stored is also where your data is stored. The clustered indexes are organized into ranges of data. For example, values 1 to 10 may be stored in one range and 90 to 110 in another range. Since clustered indexes are stored as ranges, if you need to do a search on a range for an audit log, it would be more efficient for the clustered index to be based on the date column that would be used to return the date ranges. Non-clustered indexes work better for specific value searches, e.g. "date = DateValue," rather than range searches, e.g. "date between date1 and date2."
Ever-increasing values for clustered indexes
Clustered indexes should be based on columns whose values constantly increase over time. In my prior example on using the date column from an audit log, the date values for an audit log would be constantly increasing and older dates would not be inserted into the table. This would be an "ever-increasing" column. Another good example of an ever-increasing value is an identity column, since, by design, it constantly increases.
Why am I spending so much time discussing ever-increasing values for clustered indexes? The most important attributes of clustered indexes is that they are ever-increasing and static in nature. The reason ever-increasing is so important has to do with the range architecture I outlined earlier. If the values are not ever-increasing, then SQL Server has to allocate space within existing ranges for those records rather than placing them in new ranges at the end of the index.
If the values are not ever-increasing, then once the ranges fill up and a value comes in that fits within a filled up index range, SQL Server will make room in an index by doing a page split. Internally, SQL Server takes the filled up page and splits it into two separate pages that have substantially more room at that point but take significantly more resources to process. You can prepare for this eventuality by setting a fill factor of 70% or so, which gives you 30% free space for incoming values.
The problem with this approach is that you continually have to "reindex" the clustered index so it maintains a free space percentage of 30%. Reindexing the clustered index will also cause heavy I/O load since it has to move the actual data itself and any non-clustered indexes have to be rebuilt, adding greatly to maintenance time.
If the clustered index is ever-increasing, you will not have to rebuild the clustered index; you can set a 100% fill factor on the clustered index, and at that point you will only need to reindex the less-intensive, non-clustered indexes as time progresses, resulting in more up time.
Ever-increasing values will only add entries to the end of the index and build new ranges when necessary. Logical fragmentation will not exist since the new values are continually added to the end of the index and the fill factor will be 100%. The higher the fill factor, the more rows are stored on each page. Higher fill factors require less I/O, RAM and CPU for queries. The smaller the data types you pick for the clustered index, the faster the joins/queries will be. Also, since each non-clustered index requires it to contain the clustered index key, then the smaller the clustered index key and the smaller the non-clustered indexes will be.
The best data types for clustered indexes are generally pretty narrow. Referring to data type size, it's typically a smallint, int, bigint or datetime. When datetime values are used as the clustering index, they are the only column and are normally ever-increasing date values that are often queried as range data. Generally, you should avoid compound (multiple columns) clustered indexes except in the following situations: many-to-many tables and SQL Server 2005 partitioned tables that have the partitioning column included as part of the clustered index to allow for index alignment.
Many-to-many tables and clustered indexes
Many-to-many tables are used for their extremely fast join capabilities and their ability to allow for quick re-association of records, from one owning record to another. Consider the following structure:
|CustomerID (bigint identity)||Name||Fieldn+|
|OrderID (bigint identity)||Date||Fieldn+|
The clustered indexes in this structure would be CustomerID, OrderID. The compound would be CustomerID/OrderID. Here are the benefits with this structure:
- The joins are all based on clustered indexes (much faster than joins to non-clustered indexes).
- Moving an order to another customer only involves an update to the CustomerOrder table, which is very narrow, with only one clustered index. Therefore, it reduces the blocking that would occur if you had to update a wider table such as Orders.
- Use of a many-to-many table eliminates the need for some non-clustered indexes on the wider tables such as Customer/Orders. Hence, it reduces the maintenance time on the large tables.
One negative result of this approach is the fragmentation that occurs on the CustomerOrder table. However, that should not be a big issue, since the table is relatively narrow, has only two columns with narrow data types and only one clustered index. The elimination of the non-clustered indexes, which would be needed on the Orders table if it contained CustomerID, more than makes up for this cost.
Clustered indexes and partitioned tables in SQL Server 2005
Partitioned tables in SQL Server 2005 are tables that appear to be a single table on the surface, but behind the scenes ¬-- at the storage subsystem level -- they are actually multiple partitions that can be spread across many filegroups. The table partitions are spread across various filegroups based on the values in a single column. Partitioning tables in this manner causes several side effects. I will just cover the basics here, to give you some understanding of the factors involved. I recommend that you study partitioned tables before attempting to implement them.
You can create a clustered index in this environment based on only one column. But, if that one column is not the column the table is partitioned on, then the clustered index is said to be non-aligned. If a clustered index is non-aligned, then any snapping in/out (or merging) of partitions will require you to drop the clustered index along with the non-clustered indexes and rebuild them from scratch. This is necessary because SQL Server cannot tell what portions of the clustered/non-clustered indexes belong to which table partitions. Needless to say, this will certainly cause system downtime.
More on performance tuning and SQL Server indexes
The clustered index on a partitioned table should always contain the regular clustering column, which is ever-increasing and static, as well as the column that is used for partitioning the table. If the clustered index includes the column used for partitioning the table, then SQL Server knows what portion of the clustered/non-clustered indexes belong to which partition. Once a clustered index contains the column that the table is partitioned on, then the clustered index is "aligned." Partitions can then be snapped in/out (and merged) without rebuilding the clustered/non-clustered indexes, causing no downtime for the system. Inserts/updates/deletes will also work faster, because those operations only have to consider the indexes that reside on their particular partition.
SQL Server clustered indexes are an important part of database architecture and I hope you've learned enough from this article to know why you need to carefully plan for them from the very start. It is vital for the future health of your database that clustered indexes be narrow, static and ever-increasing. Clustered indexes can help you achieve faster join times and faster IUD operations and minimize blocking as the system becomes busy.
Finally, we covered how partitioned tables in SQL Server 2005 affect your choices for the clustered index, what it means to "align" the clustered index with the partitions, and why clustered indexes have to be aligned in order for the partitioned table concept to work as intended. Keep watching for tips on non-clustered indexes (part two) coming in February and optimal index maintenance (part three) in March.
SQL Server clustered and non-clustered index design
- Part 1: SQL Server clustered index design for performance
- Part 2: Designing SQL Server non-clustered indexes
- Part 3: How to maintain SQL Server indexes
ABOUT THE AUTHOR
Matthew Schroeder is a senior software engineer who works on SQL Server database systems ranging in size from 2 GB to 3+ TB, with between 2k and 40+k trans/sec. He specializes in OLTP/OLAP DBMS systems as well as highly scalable processing systems written in .NET. Matthew is a Microsoft certified MCITP, Database Developer, has a master's degree in computer science and more than 12 years of experience in SQL Server/Oracle. He can be reached at firstname.lastname@example.org.