When it comes to SQL Server, most people think of parallelization in terms of making use of multiple processors or running queries side by side. In this article, however, I'll tell you how to arrange your data to get the most out of parallelization and how to segregate and partition data so it can be accessed as fast as possible.
Splitting data across physical files and filegroups
The most common way to parallelize access to things in a database is to segregate tables into multiple physical files. If you have a table that is being written to and read from much more aggressively than most other tables, it makes sense to put it into its own file so it can expand independently. One approach is just to place each table in its own physical file and I've seen some people follow this rule with frightening zeal. But if you have dozens or even hundreds of tables this can quickly become counterproductive, especially if some of them have much lower traffic than others.
To further parallelize access, files and groups can be split across multiple physical storage devices. In fact, this is the best step to take if you can manage it. Keep in mind that clustered indexes always live in the same filegroup as their attendant table, so they can't be split off. You can, however, split BLOB data, such as columns with ntext or IMAGE datatypes, into its own physical file. That way, you'll have it live on a separate storage device for better parallelism.
Segregating criteria
First, ask yourself this key question: What criteria is the best to use to split up the data and parallelize it? In the time I've been creating and maintaining databases, I've used two basic criteria to figure out how to split things up and group things together:
- Frequency of access. How often something is read or accessed is a factor. If you have a number of tables that tend to be read together, you can keep them in the same physical file. If the tables are updated often, though, it might make more sense to break them out into their own files.
- Frequency of updates. Tables that get written too frequently should be given their own separate physical files whenever it's practical, and they should have an appropriately large auto-expansion factor. This way, the table can expand or shrink on its own independently of everything else.
I also hinted above that the type of data is also a good segregation criterion, so I usually place text or BLOB data in their own physical files to allow more direct access.
Splitting data across databases
Data can of course also be split across more than one database. The most common use for setting up multiple databases that contain different parts of the same data set is when you're archiving historical data. Anytime you rotate part of a data set out of active use due to aging, put that data in its own database to be accessed separately.
One of the biggest reasons for this segregation is because of backup and restore. If something goes wrong with the historical database and you have reliable backups for it (and you'd better, or Murphy'll get you!), it's a lot easier to restore it independently without breaking symmetry with the rest of your data. Also, if you already have that data in its own database and your system design gracefully takes into account the possibility that such things can go offline, you can recover without having to take everything offline. Better to temporarily pull the plug on your archives than to force everyone to come back later.
Partitioning
Partitioning in SQL Server involves taking data normally located in one table and splitting it across multiple tables in some logical fashion. One obvious example of this (the one cited in SQL Server Books Online, in fact) is to split date-sorted data into 12 subtables, each for one month of the year. A downside of using this kind of partitioning, however, is that you often need to create views to rejoin the data using the UNION operator, which can be a performance killer.
SQL Server 2005 does allow tables to be partitioned in a few native ways that can further parallelize access. The above example -- splitting by months -- would be accomplished using a RANGE FUNCTION. You could split each month into a separate partition and then store it in a separate physical file. Another advantage to partitioning is that indexes (which can also be partitioned like this) can be rebuilt a lot more efficiently, since you can do them piecemeal instead of all at once. For more information on the ins and outs of partitioning in 2005, refer to my article Partition indexes for improved SQL Server 2005 performance.
Conclusion
The biggest reasons for segmenting data for parallelization are of course to improve performance and manageability. The latter is something that usually demonstrates itself over time, but the performance factor can be proven fairly quickly by gathering live performance statistics and comparing them to how things behaved before. Do this with live data whenever possible so you can gauge the degree of improvement you're getting after rearranging your data. The comparisons will indicate whether future similar optimizations will be worth the effort.
About the author: Serdar Yegulalp is editor of the Windows Power Users Newsletter. Check it out for the latest advice and musings on the world of Windows network administrators -- and please share your thoughts as well!
More information on SearchSQLServer.com
Tip: Partition indexes for improved SQL Server 2005 performance
Tip: Use cube partitions to improve Analysis Services performance
Guide: Learning Guide: SQL Server performance tuning A to Z