Microsoft Analysis Services (MSAS) is a great tool for enabling business stakeholders to analyze data from various
perspectives -- and its growing popularity can be primarily attributed to its exceptional performance.
Using pre-calculated summary value (aggregations), MSAS helps you generate reports within seconds whereas running similar reports from a relational database management system (RDBMS) can take hours. But even with a multitude of aggregations, MSAS can struggle to provide optimal performance when dealing with huge volumes of data. In this tip, I will discuss the advantages of using cube partitions – a powerful mechanism for improving query performance and reducing downtime associated with cube processing.
Partitions offer many potential benefits summarized here:
Shorter downtime: A portion of a typical data warehouse is relatively volatile and needs to be processed often. Other portions are relatively static. For example, in a sales cube, you're likely to change the current year's data nightly (or perhaps even hourly), but sales from previous years might change only occasionally to account for merchandise returns and exchanges. You can create a partition for current year sales and another partition containing all sales from previous years. If your warehouse tracks sales that occurred in the past 10 years, processing only the current partition may be 10 times quicker than processing the entire cube.
Better query performance: Queries that summarize data over 10 years could take considerably longer than those that only search through the current year data. If you have appropriate partitions, Analysis Services only has to scan a small subset of all data to return query results, which can translate into dramatic performance improvements compared to queries running against a cube with a single (default) partition.
More focused aggregations: Both MSAS 2000 and MSAS 2005 allow you to have different sets of aggregations for each partition. The partition that is queried frequently could benefit from additional aggregations, which in turn could warrant better performance. Partition(s) that aren't used heavily can be processed less frequently -- perhaps monthly or quarterly -- with considerably fewer aggregations. Usage-based optimization wizard can be executed for each partition to determine any additional aggregations that can benefit performance of queries actually executed against a given partition.
Customized storage and processing settings: Frequently accessed partitions might benefit from proactive caching and ROLAP storage. On the other hand, other forms of storage and processing might be better for less frequently queried partitions.
Distributed query and processing load across servers: MSAS allows you to create remote partitions; a remote partition resides on a server different from its parent cube. This way the queries that affect the remote partition are processed on a server separate from its parent cube, allowing you to take advantage of additional processing power.
Partitions processed in parallel: With MSAS 2005, you can process multiple partitions in parallel on a server that has multiple processors. This can further reduce the total cube processing time.
I mentioned that MSAS 2000 allows you to include the same data in multiple partitions, which may skew calculations on your reports. Fortunately, Microsoft fixed that in MSAS 2005. Perhaps what's more important is that partitions require more work. You must determine the appropriate data sets typically represented on reports and then partition data accordingly. Also, you have to decide which storage mode and aggregation scheme makes sense for each partition. More partitions mean additional maintenance overhead because you may have to merge partitions over time.
Let's suppose most of your analytical reports query current month's data; you could have a partition for the current month that you refresh daily (or perhaps even hourly), a partition for all previous months of the current year and a partition for all previous years. Once the current month is over, you'll have to merge the "current" partition with the current year's partition and create a new partition for the current month. As I mentioned earlier, MSAS only allows you to merge two partitions with identical aggregations -- so you might have to rebuild your aggregations before merging.
Even with the drawbacks, partitions present an invaluable tool for improving the performance of your analytical queries. And, more importantly, they reduce cube processing time and increase cube availability.
Figuring how many aggregations you need
The number of aggregations you create for your cubes depends on your users' analytical needs. More aggregations typically translate into better query performance, but they also use more disk space and extend cube processing time.
Disk space, typically, is not a huge concern for two reasons:
- Storage costs decrease daily and costs of building a data warehouse will far outweigh storage costs.
- MSAS offers the efficient multidimensional OLAP (MOLAP) storage option: A relational warehouse typically measured in a few gigabytes would only require several hundred megabytes to store aggregations.
On the other hand, long cube processing times present a more serious limitation. Full cube processing requires downtime, meaning no one can execute queries against cubes being processed. Longer processing time means lengthier downtime. As your warehouse grows larger, so will the cube processing time. An eight-hour downtime window (during non-business hours) may be acceptable for some environments, but users won't be happy when that window opens to 16 hours. Furthermore, downtime is not feasible at all if your business provides analytical solutions to customers across the globe, 24 hours a day, seven days a week.
MSAS also reaches a point of diminishing performance improvements with additional aggregations. Creating one-third of all possible aggregations typically provides the best performance; any aggregations beyond that achieve marginal performance improvement at best.
So how do you keep cube processing time to a minimum and boost analytical query performance? One option is to partition cubes based on certain criteria commonly used to filter queries. A partition is simply a portion of the full cube that can be limited by some criteria.
For example, if analytical queries normally limit output by customers' marital status, you could create two partitions -- one for married customers and one for single customers. Cubes are more commonly partitioned by time dimension members, as is the case when you create a separate partition for each year or each quarter.
Partitions are available in both MSAS 2000 and MSAS 2005, but they are only supported in the Enterprise Editions. This fact shouldn't come as a surprise because partitions typically make sense with large-scale data warehouses that would be more likely to use Enterprise Edition. Note also that partitions are seamless to cube users: They execute queries against the cube and don't have to specify partition names that contain particular data sets. Instead, MSAS determines which partition to poll at query runtime.
By default, each cube has a single partition containing all data in the fact table. To create a new partition, you can invoke the Partition Wizard and specify a portion of the data you wish to include in the partition. For example, if you have a time dimension that spans years 2000 through 2006, you can create a separate partition for each year.
You can also partition your cubes using multiple dimensions. For instance, the sales cube may be partitioned by promotion category, customers' education level and yearly income. Note that you can create different partitions based on the same fact table or different fact tables. For example, you could have a fact table for each year of sales, or perhaps a view for each year of sales data based on a single fact table, that can be used to build each partition.
One limitation of MSAS 2000 is that it allows you to create a partition containing data also included with other partitions. Suppose we create a partition for customers who make more than 150K per year; customers who make more than 150K will be counted twice -- once in the default partition and second in the new partition. MSAS doesn't automatically remove such data from the default partition, nor do you get a warning that you may be counting some data twice. To avoid incorrect calculations, you must edit the default partition. In addition, there is no way to exclude dimension members when designing partitions; you couldn't advise the Partition Wizard to include all income levels except those greater than $150K.
MSAS 2005 allows you to create partitions based on fact tables found in your data source views. You can also write a SQL query that can be used as the source for your partitions. MSAS 2005 warns you if you are using a fact table for an existing partition, and prevents you from counting the same data twice.
Analysis Services allows you to merge partitions, which is helpful when the current partitioning scheme is no longer effective. Suppose you have a partition for year 2006. In 2007, you may wish to combine this partition with the "historical" partition and create a new partition for 2007. You can merge two partitions if both have the same aggregations, same structure and belong to the same cube. In order to ensure that two partitions have the same aggregations with MSAS 2000, you can copy the aggregation design from an existing partition when you create a new partition. MSAS 2005 lets you copy aggregation design from one partition to another after partitions have been created.
Note that Analysis Services calculates the total number of dimension members and fact table size when dimensions and cubes are first created. Initial size of the dimension can be considerably smaller than the size of the same dimension six months later. However, MSAS 2000 never recalculates the size of the dimensions or facts. You should modify dimension level member counts and fact table size properties manually within the dimension editor and cube editor. Furthermore, MSAS 2000 doesn't have a way of determining the total number of members per partition; and you cannot specify the total number of rows for the partition. If the member counts are incorrect, chances are your aggregations will not be optimal. You should use the partition aggregation utility to specify each partition's fact table size.
About the author
Baya Pavliashvili is a DBA manager overseeing database operations supporting over one million users. Baya's primary areas of expertise include performance tuning, replication and data warehousing.