Statistics are a very important and much unappreciated part of Microsoft SQL Server. In basic terms, statistics are metadata about the data within a table and they're created and updated on a column-by-column basis, not on the entire table. The database engine employs statistics when generating execution plans that are used to access data stored within the database. For that reason, having accurate and up-to-date statistics is crucial.
In contrast, if you have invalid or out-of-date statistics, it can lead the SQL Server engine to take the wrong path to the data, and taking the wrong path means that an index scan is made when an index seek would have been appropriate or a seek is performed against the wrong index. Even worse, it would perform a table scan instead of any index operation at all.
Auto create statistics and auto update statistics
Each database has two options related to statistics: auto create statistics and auto update statistics. Typically, both should be left enabled for databases. Statistics are automatically updated when certain thresholds within SQL Server are met, i.e., the number of rows in the table increases or decreases by 10% of the number of rows the statistic was based on. Once the threshold is reached, SQL Server will automatically update the statistic(s) defined in the table.
If you have a table that stays at approximately the same number of rows, but data is added and removed on a regular
Manually updating statistics
There are two ways to manually update statistics on a table. One is via the sp_updatestats system stored procedure. The second is via the UPDATE STATISTICS command. While the sp_updatestats procedure is an easier way to program the command, the UPDATE STATISTICS command is much more flexible and much more powerful. By default, statistics are compiled based on a sampling of the data in the table. The sampling amount varies depending on the size of the table, but it can be as little as just a few percent to as much as 100% of the table. Typically, the default sampling ratio is enough to generate good execution plans. But, there are exceptions to this.
For example, my colleagues and I once had a large table with approximately 12 million records in the table. The number of records in the table was fairly static. However,
approximately 1.2 million records were deleted each morning and throughout the day a new 1.2 million records were added. The primary key for this table was an identity column with a data type of INT. The other indexed column also had a data type of INT. The bulk of the table's queries were done against the secondary index. We noticed that after we deleted the data and did a normal UPDATE STATISTICS command, the statistics would be useful for about three to four hours. At that point, SQL Server would start doing index scans instead of index seeks. The massive number of queries would send SQL Server's CPU to 100% utilization.
We decided to start updating the statistics using the FULLSCAN flag. This gave SQL Server a more realistic profile of the table, and we were able to run the UPDATE STATISTICS command only once per day instead of having to run it throughout the day when using the default sampling rate.
How to update statistics in SQL Server
As I mentioned earlier, there are two ways to update statistics. There's the sp_updatestats procedure that accepts a single input parameter called @resample. The @resample parameter triggers the procedure to get a fresh sample of records from the table, instead of using the prior sample of records. Even if you use the @resample parameter the ratio of records sampled to the total number of records in the table remains the same. In order to change this ratio, you need to use the UPDATE STATISTICS command with the SAMPLE parameter. The sp_updatestats procedure has the same effect as running the UPDATE STATISTICS command using the ALL keyword.
The exception is that the sp_updatestats procedure checks the internal system tables and looks to see if the table needs its statistics updated. If the table does not, then the table is skipped. In addition to updating the statistics on user tables, the procedure also checks system tables for stale statistics and the statistics are updated as needed.
The prefered method for updating SQL Server statistics is the UPDATE STATISTICS command. This command allows much greater flexibility when programming, however it does require more initial programming. When you fine-tune UPDATE STATISTICS to perform within your environment, you will find that it is time well spent.
When fine-tuning the UPDATE STATISTICS command, you can choose from several options. They are the FULLSCAN, SAMPLE (PERCENT, ROWS), RESAMPLE, ALL | COLUMNS | INDEX, and NORECOMPUTE options. The first two are fairly self-explanitory. The FULLSCAN performs a full scan of the index or column in question and uses all values in the column when generating the statistic. The SAMPLE option allows you to recommend the percentage of rows or the number of rows to use as the sample when generating the statistic.
You'll notice that it says recommend in the previous sentence. While SQL Server will attempt to use your recommended setting, if it finds that there are not enough data pages being processed for the statistic, it will automatically increase the number of pages to approximetly 1,000 data pages. If the SAMPLE size is set to zero (either ROWS or PERCENT), an empty statistic is created.
The RESAMPLE switch has the same effect as the sp_updatestats procedure. It tells the UPDATE STATISTICS command to resample the records for the statistic. If you have a table that gets new records often, it is important to include the RESAMPLE flag when forcing statistics to be updated. This ensures that your new records are available for sampling. The ALL | COLUMNS | INDEX option specifies which statistics to update. In some cases, only your column statistics need updating, while in other cases only your index statistics need updating. In some cases, both will need updating.
When using the UPDATE STATISTICS command, you must specify the table or view where statistics need updating. You also have the option to specify the index or statistic name to update. If no index or statistic is named, all statistics are updated (unless restricted with the ALL | COLUMNS | INDEX option).
UPDATE STATISTICS TableName WITH SAMPLE 25 PERCENT, ALL
False SQL Server statistics
You may find that you need to test your execution plans on large tables, but you do not have large tables against which to test them. Microsoft includes a
couple of "undocummented" options for the UPDATE STATISTICS command that you can use to force SQL Server to generate statistics that make SQL Server think there is more data in the table than there actually is. These options are ROWCOUNT and PAGECOUNT. As they suggest, these options allow you to trick SQL Server into thinking that there is a specified number of records in the table or a specific number of pages being scanned for the statistic.
Undocumented options should never be used in a production environment. They should be used in testing only, so you can check what your execution plans look like once the tables are fully populated. Attempting to use these statistics in a production environment can lead to unexpected and unwanted results.
UPDATE STATISTICS TableName WITH ROWCOUNT=100000
In the example above, I take a table with only a few records in it and generate statistics for a table with 100,000 records. Keep in mind that these options are not supported and will produce strange results from time to time.
Viewing SQL Server statistic results
The DBCC SHOW_STATISTICS command will show you the actual data from the statistic, which can be useful when troubleshooting performance problems. When we create our sample table:
CREATE TABLE TableName
(C1 int primary key)
And we load the table with sample records:
insert into TableName
We then update the statistics on this table (you may need to change the name of the statistic):
UPDATE STATISTICS TableName
When we view the statistic with DBCC SHOW_STATISTICS:
We get three result sets back:
The first shows basic information about the statistic:
|Name||Updated||Rows||Rows Sampled||Steps||Density||Average Key Length||String Index|
|PK__TableName__117F9D94||Aug 20 2007 8:28PM||1813||1813||177||1||4||NO|
The second result set gives us the scan density and average data length of the records for each column in the statistic:
|All Density||Average Length||Columns|
The third result set is the most important. It tells us the ranges of the data and the number of rows per range. (The table truncated to a few sample rows for brevity).
The RANGE_HI_KEY shows the higest value for that key range. The RANGE_ROWS shows the number of records between the RANGE_HI_KEY value and the new RANGE_HI_KEY value. The EQ_ROWS is the estimated number of rows matching the RANGE_HI_KEY value. The DISTINCT_RANGE_ROWS value is the estimated number of distinct values between the current RANGE_HI_KEY and the next RANGE_HI_KEY. The AVG_RANGE_ROWS is the estimated number of duplicate values between the current RANGE_HI_KEY and the next RANGE_HI_KEY.
If you have wide ranges with few records, you may need to adjust the scan percentage of your index.
As you may have gathered from this tip, statistic updating and tuning can be a very complex task. When done correctly, it can provide that extra bit of performance that the system may not have provided under the default statistic configuration. These techniques allow you to boost SQL Server performance and get a return on every dollar invested in your SQL Server infrastructure.
ABOUT THE AUTHOR
Denny Cherry has over a decade of experience managing SQL Server, including MySpace.com's over 175 million user installation, one of the largest in the world. Denny's areas of technical expertise include system architecture, performance tuning, replication and troubleshooting. Denny uses these skills on a regular basis in his current role as a Sr. Database Administrator and Architect at Awareness Technologies. Denny is a longtime member of PASS and Quest Software's Association of SQL Server Experts and has written numerous technical articles on SQL Server management.
This was first published in October 2007