Many companies today are building data warehouses to consolidate data from various sources and make it presentable. The size, complexity, and implementation vary, but one thing is constant: Managing and administering data warehouse tables requires more planning and thinking ahead than online transactional processing (OLTP) systems. Let’s take a look at several challenges database administrators (DBAs) face as they handle ever-larger data marts.
When working with a data warehouse, you’ll find that operations you are used to doing quickly on small tables aren’t possible without planning and downtime. Table-locking operations such as re-indexing or adding a column can take hours instead of seconds. In some data warehouses you might have a longer window for maintenance than you’d have in an OLTP system, but once you get used to having billions and billions of rows it’s an entirely different game.
Luckily, SQL Server offers features that make managing very large tables much easier, though many come exclusively with one of Microsoft’s Enterprise Editions. For example, the re-indexing feature lets you rebuild indexes in online mode, meaning the table is not locked and is available while the index is built.
Another enterprise feature worth looking into is table partitioning. Loading data into a large table with many indexes is slow going because indexes need to be updated after each row is inserted. If you drop indexes first, your load will be faster, but then you need to rebuild them. When the table is huge and you are adding data, say, daily, re-creating the indexes means you are essentially doing the same work over and over.
Table partitioning solves that problem by letting you create a suitable partition (for example, month, day, year and month, year and week) and run your partition load in an empty staging table. Once your load is done, create the same indexes you have on the main table and “switch” the partition into the table. On the back end, SQL Server just modifies the metadata, and the staging table becomes a part of the main table. If you have tables larger than 30 GB to 40 GB, take a serious look at partitioning.
Just as you can switch in a partition, you can switch it out and remove millions of rows from a table in milliseconds. If you’ve ever tried deleting millions of rows in a batch in small increments, you know it takes hours and increases the size of your transaction log and log backup files. Switching out a partition is again just a matter of modifying the table metadata and internal pointers.
Page compression can also provide significant data-querying improvements. It can be especially beneficial in data warehouse tables that contain multiple occurrences of the same value. If you compress your data pages, you’ll see two benefits: Your data will take up less space, and queries will run faster. The price you pay is higher CPU, so make sure to test this and monitor the CPU usage. Most database servers have spare CPU available, since the disk I/O tends to be the first bottleneck.
Adding a non-nullable column to a large table may take a long time. That’s because SQL Server needs to expand each row, append a column and populate it with a value. The table remains locked, and on large tables it could take hours. You are better off adding a column that allows nulls and running a process to populate the new column with desired values. Once that’s done, change the column type so that it doesn’t allow nulls.
Since data warehouse servers need power and often multiple processors, enterprise licensing costs are pretty steep. In some cases you could license your data warehouse with the SQL Server client access license (CAL). This allows a small number of users to access the data directly. But once you complete your data loads and processing on a server licensed with CAL, transfer aggregated data and reporting tables to a cheaper SQL Server instance -- say, a Standard or Workgroup edition -- that’s licensed per processor. That way, you can access that data from a Web application.
DBAs face several challenges when building a data warehouse, but managing large tables is certainly doable. Strategies like online index rebuilding, index partitioning or page compression can help, for example, by improving the speed and increasing database availability for such tables. Remember, these are extraordinarily large data volumes, so think ahead and plan your data load strategies carefully. Your organization is counting on you to keep those data warehouse tables available.
ABOUT THE AUTHOR
Roman Rehak is principal database architect at MyWebGrocerin Colchester, Vt. He specializes in SQL Server development, database performance tuning, ADO.NET and writing database tools. He contributes to Visual Studio Magazine, SQL Server Magazine and other publications and presents at user groups and conferences in the U.S. and Canada.