Performance tuning is the ultimate responsibility of every valuable SQL Server administrators. Although ensuring data security and availability are our topmost goals, if the applications don't perform to the users' expectations, DBAs often get the credit, or blame, for poor design and implementation. SQL Server 2005 offers numerous improvements in terms of database performance, not the least of which is table partitioning. If you haven't examined the table partitioning feature, do yourself a favor and take time to read this tip.
The concept of partitioning tables is not new; if you've worked as a SQL Server DBA for a while, you've probably had to archive data in a heavy transactional table once the historical data is no longer accessed frequently. For example, if you have a time reporting application, your reports probably query records from the year 1995 very seldom; most budgetary projections will be based on records from more recent years.
With earlier releases of SQL Server, you could create multiple tables. Each table would consist of the same columns and store data from different years. Then you could create a view which queried each of these tables, if accessing historical data was necessary. Having multiple tables is handy because scanning through small tables will be quicker than querying a huge table. But this benefit is only available if you know ahead of time which time period you wish to retrieve to get the data. You would also have to create a new table and move data every so often as data became obsolete.
SQL Server 7 and SQL Server 2000 support the concept of distributed partitioned views (sometimes referred to as materialized views). Distributed partitioned views consist of tables with the same schema on multiple servers, and you have to add linked server definitions to each server and create a view on one of the servers to union the data returned from each server. The idea is that the database engine would exploit the processing power of multiple servers to satisfy queries.
Distributed partitioned views (DPVs) have many limitations, which you can read about in SQL Server online documentation. Although DPVs provide performance benefits in some cases, they aren't particularly widespread. Arguably they don't offer a silver bullet for scaling enterprise-level applications. Furthermore, implementing DPVs is a laborious process, requiring much work on behalf of a DBA.
SQL Server 2005 supports table partitioning, whereby all partitions reside on a single server. Each table partition is associated with a single file that is part of a filegroup. The same file/filegroup can hold multiple partitioned tables.
With this architecture, the database engine determines which partition to poll at query execution time, without scanning the entire table. If data rows requested by the query reside in multiple partitions, SQL Server uses multiple processors to query more than one partition in parallel. You can partition indexes along with the tables on which they are created. Seeking or scanning through a smaller index is considerably faster than scanning the entire table or scanning the clustered index on a large table. So table partitioning can yield significant performance benefits for queries executed against large tables.
Let's look at a quick example of how table partitioning works. I won't go into the details of partitioning syntax in this tip, but you can look it up in SQL Server's online documentation. The example below is based on the data warehouse containing time reporting system's records. I will create seven filegroups in addition to the default filegroup, and each filegroup will contain a single file that will store a subset of data as defined by the partition function:
To test performance benefits of table partitioning, I added 15 million rows to the partitioned table, as well as an identical table that was not partitioned. The INSERT statement running against the partitioned table was considerably faster. Even on my laptop, which has less than a Gigabyte of RAM, the INSERT statement against the partitioned table was three times faster than against a table created on a single filegroup. Of course query execution times will vary depending on hardware resources at hand, so you can see a different level of improvement in your environment.
I further examined the execution plan of a simple SELECT statement returning all rows from partitioned and un-partitioned tables, given a specific date range in the WHERE clause. The same query had different execution plans for the two tables. The query against the partitioned table shows nested loops and a clustered index scan. Essentially, SQL Server treats the two partitions as separate tables and joins them together using a nested loop. The same query executed against the table without partitions uses a clustered index scan to return the same row set. Performance gains are even more significant if you have multiple tables created using the same partitioning scheme, and your queries join such tables.
You can examine the number of rows in each table partition by using the following query:
SELECT $PARTITION.TimeEntryDateRangePFN(time_entry_date) AS Partition,
COUNT(*) AS [COUNT] FROM fact_time_entry
GROUP BY $PARTITION.TimeEntryDateRangePFN(time_entry_date)
ORDER BY Partition
Table partitioning is a great feature for transactional as well as data warehousing environments. One of the main complaints from data warehouse users is that populating the fact table sometimes takes too long. While loading data into the fact table, the performance of user queries (or cube processing queries) will be significantly slower and might not succeed at all. Therefore, loading a large number of rows into a fact table might require downtime. With table partitioning this doesn't have to be the case any longer – you can populate the fact table in a blink of an eye -- literally. To demonstrate how this works, I'll create another table using the same partition function and partition scheme as before, this time called fact_time_entry2. I will seed the identity column at 50 million so the fact_time_entry2 table won't contain any keys that are also present in fact_time_entry:
Next I will populate this table with data for the year 2007. And let's suppose that fact_time_entry table has data for years before 2007. Once fact_time_entry2 table is populated, I can execute the following statement:
ALTER TABLE fact_time_entry2
SWITCH PARTITION 8 TO fact_time_entry PARTITION 8
This statement moves partition number 8, which happens to have data for 2007 from fact_time_entry2 into fact_time_entry and takes 3 milliseconds on my laptop. Within 3 milliseconds, my fact table has 5 million additional records! Yes, I had to populate the intermediate table prior to swapping partitions, but my users never need to worry about that – the fact table is available for querying with zero downtime! Behind the scenes, no data movement takes place – only the metadata for two tables is changed.
I could use a similar query to remove the data no longer needed in the fact table. For example, let's suppose we decided that we no longer care about time entry records from 2004. The following statement moves such records to the work table we created:
ALTER TABLE fact_time_entry
SWITCH PARTITION 2 TO fact_time_entry2 PARTITION 2
Again this statement completes in a matter of milliseconds. At this point I can drop fact_time_entry2 table or move it to another server. My fact table will no longer have any records from 2004. The partition must already exist in the destination table and it must be empty. You cannot switch partitions to a replicated table. The source and destination tables must be partitioned the same way, and the data to be transferred must reside in the same filegroup for both tables. Even with these limitations, the ability to switch partitions and populate fact tables with zero downtime is bound to make SQL Server data warehouse implementations more successful than ever before.
ABOUT THE AUTHOR
Baya Pavliashvili is a DBA manager with HealthStream, the leader in online healthcare education. In this role, he oversees database operations supporting over one million users. His primary areas of expertise include performance tuning, replication and data warehousing. You can reach Pavliashvili at email@example.com.
More on SearchSQLServer.com
- Business intelligence features in SQL Server 2005
- Get techniques to boost data warehouse performance in this tip
- Segregating data for optimum parallelization