BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Because issues of scale often arise in large data sets, most of the new features for scalability and capacity management...
in Microsoft’s upcoming SQL Server 2012 are aimed specifically at huge workloads: data warehousing, business intelligence (BI) and decision-support applications. Putting together apps like these is hard enough, but maintaining their data structures is a whole different ball game.
Column-store indexes are billed as the biggest scalability improvement in SQL Server 2012, now available in a community technology preview (CTP). A column-store index packs data for each column on a separate set of disk pages. This is in contrast with the traditional method of storing multiple rows per page.
Microsoft touts several advantages of using this alternate indexing method. For one, storing columns this way makes retrieving the columns needed for computing a particular query easy and fast. This is because the index compresses data; columns are on average more redundant than whole rows and therefore more compressible. Disk access patterns and caching allow the data to be pulled and retained much more quickly. The more columns you have in a given table or the more columns you plan to add to a given table, the better a columnar index will let that table scale up.
The advantages from a scalability point of view are clear. You can add capacity to a database without worrying about how that might affect the performance of BI queries, where the amount of data can run into billions of rows. Microsoft’s own preliminary work claims orders-of-magnitude speedup for various kinds of queries on such huge numbers of rows, depending on the types of data at hand.
One thing to note is that once you add a column-store index to a table, it becomes read-only. Specifically, you can’t use INSERT, UPDATE, DELETE or MERGE statements or use bulk operations to add data. My guess is this is a way to keep the column-store index from being repeatedly updated, thus killing the performance advantages of having one in the first place.
If you need to run queries without the column-store index (for instance, to determine the net performance gain) you can use the IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX option in a Transact-SQL (T-SQL) query -- the very name implies that Microsoft has plans to roll out clustered column-store indexes.
More table partitions
One way to add data to a table with a column-store index -- short of dropping the index, adding the data and recreating the index -- is to switch in a new partition for the table. By default, a table in SQL Server allows up to 1,000 partitions. SQL Server 2012 ramps up the limit to 15,000 by default, which makes it a lot easier to switch in partitions needed for massive operations. Apparently, Microsoft wanted to give users the option of using one partition every day; With 15,000 partitions, you can manage 41 years of data this way.
While adding partitions gives you more scaling flexibility, it doesn’t come without a cost. If you plan on using a lot of partitions, your database server should have at least 16 GB of random-access memory. This is not an outwardly enforced limit; SQL Server won’t explicitly refuse to mount a database with more than 1,000 partitions on a certain machine if it doesn’t detect some minimum amount of memory. But the more partitions you have, the greater the odds you’ll run into performance problems or memory exhaustion on queries.
If you want to use SQL Server 2012 CTP as a test bed for capacity planning with these new features in place, one thing to keep in mind is how the CTP handles statistics. Microsoft says, “Statistics [in the SQL Server 2012 CTP] are not created by scanning all the rows in the table when a partitioned index is created or rebuilt. Instead, the query optimizer uses the default sampling algorithm to generate statistics. … To obtain statistics on partitioned indexes by scanning all the rows in the table, use CREATE STATISTICS or UPDATE STATISTICS with the FULLSCAN clause.” Microsoft claims these changes won’t affect query performance, but they can be problematic if the precision of your statistics affects any capacity or performance planning work.
These features were activated in SQL Server 2012 CTP3 and are not available in the earlier public CTP, which was released last November. Also, if you’re dealing with databases that still use the 32-bit x86 architecture, bear in mind that a table or index with more than 1,000 partitions is possible on those systems but it is not a supported configuration, thanks to Microsoft’s general push toward 64-bit servers.
AlwaysOn is Microsoft’s high-availability and disaster-recovery technology in SQL Server 2012. It uses a variant on the same “availability group” technology the company rolled into recent versions of Exchange.
I’m mentioning AlwaysOn here for two reasons. One, scalability and capacity planning will be affected by how you implement mirroring and clustering. The good news is that clustering in Windows no longer relies as heavily on homogenous hardware, so it’s a little easier to scale up (or out) and keep clustering working. Plus you won’t have to store everything in one place or use the same hardware across the board.
And two, AlwaysOn can be used to offload backups to a secondary replica as a way to ease the I/O and processing load on your main servers. If you haven’t given much thought to altering your backup infrastructure, this is a good excuse to do so. The end result: You can reduce the need to add capacity on the front end and scale out your back end much more efficiently.
FILESTREAM and FileTable
Another feature in SQL Server 2012 that may affect scalability and capacity is an improvement to the FILESTREAM feature introduced in SQL Server 2008. FILESTREAM allowed SQL Server to work directly with storage on the huge volumes that New Technology File System (NTFS) can hold, so binary large objects (blobs) can be stored as files on disk but addressed with T-SQL code. This lets SQL Server leverage NTFS for storing freeform data outside the constraints of the database engine -- a good way to enhance scalability if you previously shoehorned freeform data directly into databases.
SQL Server 2012 expands on FILESTREAM with FileTable, which does something even more interesting. It allows SQL Server applications to see the on-disk file system as a database table, write files to it and have other Windows applications -- not just other SQL Server apps -- recognize those files as well. FileTable uses a standard Windows share to expose the files in question to applications throughout Windows, so to get conventional Windows apps to work with them all you need to do is point them at the new share. FileTable’s even implemented as a Common Language Runtime class, so it can be accessed using .NET framework methods, too.
I should point out that while the new features in the current CTP are undeniably attention-getting, don’t count on using any of them in a production environment until SQL Server 2012 is officially rolled out. If you want to get your feet wet with 2012, do it in an isolated fashion: Create a separate database instance on its own machine (or in a virtual machine) and work on copies of data, not originals. Microsoft has a history of offering remarkably polished technology previews -- the pre-beta editions of Windows 7, for instance, were astonishingly solid -- but that’s no reason to put your data or your database system’s integrity at risk.
About the author:
Serdar Yegulalp has been writing about computers and IT for more than 15 years for a variety of publications, including InformationWeekand Windows Magazine.