Many SQL Server developers do not give data types a lot of thought when they design database schemas. Initially it doesn't seem to matter much and the performance looks pretty good, but, over time, the system slows down. Let's take a look at the various SQL Server data types, the effects they have on performance, and what you should consider when choosing a data type.
Optimizing data types means choosing them carefully, as they have a huge impact on I/O, CPU and RAM consumption. A larger data type takes up more room in the cache (RAM) resulting in a smaller hit ratio on the cache, which causes more fetches from disk (more I/O). This results in a heavier CPU hit, longer update times, increased lock times and increased contention on the table, which will show up as blocks. Indexes defined on larger data types take longer to scan/seek because more pages have to be read, which will impact RAM, CPU and I/O. Larger data types also increase maintenance times on the various indexes, yielding longer "maintenance windows."
Primary key considerations
Clustered indexes default to the primary key fields in SQL Server. That said, several factors are important with data types. For this discussion, we will assume that your primary key is the same as the clustered index, since that is the preferred approach.
Any data type you pick for the primary key (usually also the clustered index) is included within every non-clustered index as a reference back to
the original row. For example, if you pick the data type datetime as the primary key rather than an int type, any non-clustered indexes will be an additional 4 bytes larger per row because the non-clustered index has to hold the primary key value. In tables with more than two billion rows, this can quickly add more than 2.4 GB per non-clustered index. You'll see the impact on CPU, I/O and RAM quickly accumulate in this scenario.
Smaller data types make it easier -- i.e., faster and less resource intensive -- for SQL Server to ensure that any inserted or updated values are unique against the existing table values.
Individual data type considerations
We will just cover the commonly misused data types here.
- Datetime – Usually when storing a datetime value, everyone defaults to the datetime data type without considering smalldatetime. The datetime type is stored as two ints accurate to 3.33 milliseconds with a range of Jan 1, 1753-Dec 31, 9999. However, for many database operations, this kind of range and precision is not necessary. The smalldatetime data type is stored as one int, accurate to one minute with a range of Jan 1, 1900 – Jun 6, 2079.
- Many datetime values can be stored as a smalldatetime when the accuracy requirements do not have to be precise to the sec/ms and the dates are generally limited to date ranges within a few decade ranges. The benefits of using smalldatetime are apparent, since at 50% space saved, a table is much more efficient. Datetime data types should not be used within a primary key unless frequent searches are done on date ranges.
- Bit vs. Char(1) – One bit will take up 1 byte, 8 bits will still take up 1 byte, and a char(1) takes 1 byte. It could be argued that using a char(1) to store Y/N is more readable, but it is far more efficient to use a bit column if the table contains more than 1 bit column. You'll also eliminate the possibility of someone trying to put an A/B/C/etc. in the char(1) field, leading to incorrect data/functionality. Generally, it is considered better practice to use the bit data type. Even if the table only contains a 1 bit column, it will allow you to add additional bit columns in the future. There will be no need to modify the data type/data on the existing char(1) column in order to take advantage of the optimal data storage configuration.
- Integer – Integer types consist of tinyint (0-255, 1 byte), smallint (-32,768-32,767, 2 bytes), int (–2,147,483,648-2,147,483,647, 4 bytes) and bigint (–9,223,372,036,854,775,808-9,223,372,036,854,775,807, 8 bytes). These data types are often used in primary keys and non-clustered indexes, so it's important to pick the smallest data type available from the list.
- It's tempting to just pick a large data type in SQL Server, but it's better to plan for immediate needs in the next 5 to10 years. You can always convert to a large data type later when CPUs, I/O and RAM have more bandwidth to support larger types. Also, since this data type is often used in primary keys (and consequently non-clustered index keys), the smaller the data type, the less work SQL Server will have to do during index maintenance.
- UniqueIdentifier – This data type consists of 16 byte storage, so it's very wide and is used to store GUIDs. It is commonly used as a primary key due to its unique nature. It is generally easier to merge two (or more) existing databases that have UniqueIdentifiers as primary keys as opposed to int types. The latter would require every row to be adjusted for a database merge.
With that in mind, be careful with this data type, since often you will have a UniqueIdentifier as a primary key along with several UniqueIdentifier foreign keys, making all the indexes very large, resource intensive and much more difficult to maintain. The increased row/index sizes also increase the overall load on the server. UniqueIdentifiers make it harder for SQL Server to check for "uniqueness" on inserts/updates, and that puts more pressure on the server. Generally, for high-transaction OLTP systems, you are better off avoiding the use of UniqueIdentifiers as primary keys.
When this data type is used as a primary key, you should use the NewSequentialID() rather than the NewID() function. That's because NewSequentialID always grabs a sequential ID, resulting in a clustered index that is "ever-increasing," which eliminates fragmentation on the clustered index. The NewID() function will cause heavy fragmentation on a clustered index due to page splits.
- Decimal – In SQL Server 2005 SP2 and greater (Enterprise, Developer and Evaluation Editions only), you can turn on the vardecimal option (database properties/Options/VarDecimal Storage Format Enabled). The vardecimal option varies the storage required for the decimal data type, depending on the actual data stored in the column. You can only turn it on at the database level rather than the table level.
- In order to determine the space that will be saved on a particular table, run the sp_estimated_rowsize_reduction_for_vardecimal (located in master) system stored procedure. No changes need to be made on the column to use vardecimal and it will have no functional impact on code, since it is merely performance-related functionality.
- (n)Char(n) vs. (n)Varchar(n) – If a text value is always fixed in length, then it's best to use char(n). Varchar data types require an additional 2 bytes of storage to store offset information, so that space is wasted if the text size is fixed.
- (n)Varchar(n), (n)Varchar(Max), VarBinary(n), VarBinary(max), (n)Text, and Image-- When you plan to store large values, it is very important to pick the correct data type. These data types have the potential to be an issue, depending on the sizes chosen. If you expect the text/data size to be fairly small, then it is recommended that you store the data on the data row, meaning you would want to use (n)Varchar(n) or VarBinary(n), which makes it subject to the 8k limit. If the data is potentially large, then (n)Varchar(Max) or VarBinary(max) should be used, as the VarBinary(max), (n)Varchar(Max), (n)Text, and Image data types can be stored "out of row."
Storing large object data "out of row" means that the regular data row merely contains a pointer to another data row that stores the large object data, thus eliminating the 8k limit for the large object and resulting in smaller data row sizes. If a query does not require/return the large object, then it just scans/seeks on the smaller data row returning results much faster. It's able to transverse fewer data pages than if the large object was stored "in row."
If a query does require/return the large object, then it scans/seeks for the smaller data row and uses the pointer on that row to return the large object. You are better off not using the (n)Text or Image data types, replacing them with (n)Varchar(Max) and VarBinary(max), respectively.
If you choose to use the VarBinary(max), (n)Varchar(Max), (n)Text, or Image data type then it is best if the option to store the value "out of row" is chosen. This setting is controlled by calling the system stored procedure sp_tableoption N'MyTable', 'large value types out of row', 'ON'.
The size of the data rows and indexes impacts a wide range of hardware resources. SQL Server performance will increase exponentially as you reduce the size of the rows and indexes. A few fast points to remember: Always pick the smallest data type, use bits and integers (the smaller the int type the better) with lookup tables where possible – trying to avoid the use of (n)varchars.
If you have to use large text/binary values, then consider defining them as (Max) types and setting the options to store the data "out of row." If the text data is fixed length, then always use (n)char(n) data types so 2 bytes is not wasted storing offsets. If your primary concern is performance, avoid using any data types other than ints for primary/foreign keys, as this will significantly reduce your data row and index sizes. SQL Server data types seem like a small issue at design time, but they will have a huge impact on your system as it grows organically over time.
ABOUT THE AUTHOR
Matthew Schroeder is a senior software engineer working on SQL Server database systems, ranging in size from 2 GB to 3+ TB, with between 2k and 40+k trans/sec. Matt currently works for the gaming vendor, IGT, providing services to gaming companies. He also works as an independent consultant, specializing in SQL Server, Oracle and .NET for industries such as gaming, automotive, e-commerce, entertainment, banking and non-profit. Matt specializes in OLTP/OLAP DBMS systems as well as highly scalable processing systems written in .NET. He is a Microsoft certified MCITP: Database Developer, has a masters degree in Computer Science, and has more than 12 years of experience in SQL Server/Oracle. He can be reached at [email protected]