Project Apollo casts light on column-store indexes in SQL Server Denali

Project Apollo, part of the bells and whistles that comprise the upcoming SQL Server Denali, is Microsoft’s innovation on the index. Column-store indexes work in a whole new way, and they have the potential to nimbly scan millions of rows in a second.

Last year Microsoft introduced PowerPivot, a new add-on for Excel designed to let less-technical users create business intelligence (BI) applications. In demos at the Professional Association for SQL Server (PASS) 2009 conference and TechEd 2010, PowerPivot selected a data source and then brought all the data into the Excel file itself, and a table with 20 million rows took up only about 50 MB of storage in the Excel file. More amazing was that as the table downloaded table into Excel, PowerPivot performed several data scans of these 20 million rows in about a second.

SQL Server professionals were thrilled, and many predicted that eventually it would make its way into SQL Server itself. Sure enough, at the PASS conference last November, Microsoft announced Project Apollo -- a feature of the upcoming SQL Server Denali -- bringing column-store indexes into the database engine, resulting in some queries running up to several hundred times faster than by using regular indexes. During the demo, queries were run side by side. The query using the traditional indexes took more than a minute, while similar queries running on a column-store index returned results instantaneously from tables with several billion rows.

So what makes column-store indexes so much more efficient? The fuel behind this lightning-fast search capability is a new Microsoft technology called VertiPaq, which has a different way of storing columns than traditional indexes, and it effectively compresses the data in the index. In a regular index, all indexed data from each row is kept together on a single page, and data in each column is spread across all pages in an index. In a column-store index, the data from each column is kept together so each data page contains data only from a single column, as shown in Figure 1.

Figure 1

In addition, the index data for each column is compressed, and since many columns often contain highly repetitive values, the compression ratio can be very high. This architecture reduces the number of pages in the index and, if you are selecting only a few columns, it also reduces the number of pages that need scanning. Because the nature of the data varies, it is difficult to predict how much faster data retrieval will be in your scenarios, but based on comments from Microsoft it could go from several times faster to several hundred times faster. (Note: I haven’t been able to perform benchmarks since Project Apollo technology is not available in the first community technology preview, and CTP2 hasn’t yet been released.)

Building a column-store index is easy. You use the same index creation syntax and just specify the keyword COLUMNSTORE:

                 ON MyTable(colum1, column2, …)

 Follow the same rules you use for building a regular index and include only the columns you need. There are several things to keep in mind. In CTP1 It takes a lot longer to build a column-store index than a regular index, though that is likely to improve in future CTPs and its release-to-manufacturing version. Also, once you add a column store on a table, the table itself becomes read-only, so no inserts, updates or deletes. If you need to insert new rows or update existing ones, you can disable the index, do data modifications and rebuild the column-store index. Because of this limitation, this feature for now is more suitable for data warehouse tables that contain static data, where it’s acceptable for the data to be refreshed during only scheduled intervals. You can, however, use partitioning to avoid having to rebuild the index. For example, you can create a daily, weekly or monthly partition, load the data into a new table, build all indexes, and then switch in the table into the partitioned table.

There are three reasons why column-store indexes provide faster querying. First, since the column data in the index is compressed, it takes up less space, and SQL Server has to scan fewer pages.

And second, since fewer pages are used, it is more likely that SQL Server will be able to keep them in memory, greatly improving the likelihood of this data remaining in the memory buffers. So on systems with a lot of memory, you should see a big difference in performance, especially when the index is scanned multiple times and corresponding row store index does not fit into the memory.

Third, SQL Server only needs to retrieve index pages for the columns used in the query, as opposed to scanning regular index pages with additional columns that are not used in the query. So again, you get the same benefit -- since fewer pages need to be processed, it will get the results faster, and SQL Server is more likely to keep the columns cached in memory buffers.

Project Apollo column-store technology can provide significant benefits, improved user experience and reduce the times users have to wait to get back results from data warehouse tables. And though the limitations in this version make this technology unsuitable for tables in online transaction processing, or OLTP, environments, you’ll a huge difference in your data warehouse implementations. But who knows? In future versions of SQL Server, Microsoft could lift these limitations, and column-store indexes will be used the same way traditional indexes are today.


Roman Rehak is principal database architect at MyWebGrocer in Colchester, Vt. He specializes in SQL Server development, database performance tuning, ADO.NET and writing database tools. He contributes to Visual Studio Magazine, SQL Server Magazine and other publications and presents at user groups and conferences in the U.S. and Canada.

Dig Deeper on SQL Server Business Intelligence Strategies