Data warehouses are typically the largest databases at any organization. Building and managing systems is a large...
undertaking and these projects can quickly become unwieldy with numerous users providing incompatible input. Boosting the system's query performance is possible but takes planning, followed by insightful design and development phases. In this tip we will outline techniques to capture and plan for performance needs and then we'll boost your data warehouse performance on SQL Server.
Performance cannot be an after thought for any system let alone systems supporting hundreds of gigabytes of data to multiple terabytes. As you gather the data warehouse requirements, drill into the user performance requirements. Based on the expectations consider a variety of options to deliver the data to them.
Once you've established users' needs, determine estimates for systems' growth in six-month increments over the next three to five years. Determine the number of users that will be using the system over the same time period. Next, determine which business functions will be leveraging the data and the impact of slow performance to their productivity. These estimates will give you a feel for how the users will want to use the data as well as their performance expectations. Now it's time to develop a strategy for distributing the data.
It is important to understand how the users want to use and access the data. If they are primarily looking for daily reports with some ad-hoc reports, then you have the luxury of doing the majority of the heavy lifting after hours. This would include the data load from the operational systems, then querying the data to build and deliver the reports. This is a much different paradigm than having to support a near real-time data warehouse where the data must be updated from the operational system on an hourly basis in order for decisions to be made by individual departments or across the entire organization.
Once you have the system estimates and data delivery, start to think about the infrastructure that supports the data warehouse. This will not only be the hardware and SQL Server version for the data warehouse, but also the client side application and client hardware. You do not want to be in a situation where you have a powerful data warehouse SQL Server with wimpy client machines that cannot support the volume of data returned in the reports. Another consideration is the front-end application, which depends on the vendor and can range from a simple interface to sophisticated features customized for your business.
SQL Server hardware design
Once you configure your hardware, it is difficult to make changes to the system without a significant investment of time across the entire IT organization. Plan for the future. Either purchase a system that allows you to upgrade the CPUs, memory, disk drives, controller and so on, and you are confident you can purchase equipment in the next one to three years, or purchase a server that is over-powered initially, but will support the needs in the long term. When it comes to configuring disk drives to properly support SQL Server, check out my tip on Optimizing SQL Server hardware performance. Depending on the volume of data, number of users and budget, you may want to consider a SAN to support the I/O needs.
Life of the system
Based on the hardware, applications and user requirements, tell your users the life of the system from an IT perspective. Let them know the current system will support a specific amount of user needs for a specific period of time. For instance, this system could support 50 concurrent users for three years and at that point the system would need to be re-evaluated.
Data warehouse database design
Based on the reports and the team's data warehouse experience, you might have a single or multiple databases. In a multiple database design, you can set up specific databases for the following purposes:
- Data loads from operational systems
- Cleansed raw data from the data loads database
- Aggregated/summarized data for the users to query
- Archived data
These databases offer the developers a means to import data from the operational systems, massage the data to ensure it is free of errors and then to build reports or
summarize data for the user's consumption. From a performance perspective, it may be beneficial to split these databases on different disk drives. That may prevent I/O contention from limiting the performance between the systems.
From an underlying database design perspective, the database to support the data loads may match the operational system. The cleansed raw data may either be in a dimensional or relational design to meet the data warehouse needs. Finally, the aggregated\summarized database design may be a combination of dimensional or a relational design. It may also simply be a table to support data for a single report.
I will explain archived data in the next section.
Base the design of the database on the data type, while keeping in mind the growth rate for the data set. For example, consider logically splitting the data into smaller tables based on the date, such as a table per quarter or month. If you can functionally split the data, determine if it makes sense. For example, if you have phone data, does it make sense to have all of the data in one table or have separate tables for cell phone data, house phone data, corporate data, long distance, local, etc.?
Another consideration is to negotiate the frequency for archiving data with your users. Depending on the data or industry, you may have some legal restrictions. The volume of data is going to have the largest impact on performance. Determine if your raw databases can support a finite period of time and then move the older data to an archive database with the same database design. If research and reporting are needed, this data can still be accessed, but it will not impact the primary database supporting 99 percent of the queries.
SQL Server features and configurations
Many of the items outlined in this tip are really independent of SQL Server and could potentially be used with any database platform to support a data warehouse. Now we will jump into opportunities specific to SQL Server that boost a new or existing data warehouse's performance. These items include:
SQL Server 2005 partitions
Partitions are a new SQL Server 2005 feature. With a partition, a column or columns are selected in addition to a range of values to split the table horizontally in a predetermined number of underlying file groups. A common partition column is a date column or a logical grouping for the table such as a region, time zone, etc. This has the potential to improve query performance for large tables that are common to data warehouses because file groups can be split across a number of underlying physical disks, which can improve the I/O. You can also create partitions for indexes that follow the same performance improvement paradigm. For additional information about partitions, reference Planning Guidelines for Partitioned Tables and Indexes.
If partitioned tables are new to you, consider using the Database Engine Tuning Advisor as a means to gain some insight into your system and the partition features. The Database Engine Tuning Advisor assists you in making recommendations for a number of performance improvements to include partitions. Just save the results from a SQL Server Profiler session, and Database Engine Tuning Advisor can analyze the results and provide potential partition recommendations. It is prudent to take those results and calculate meaningful performance metrics related to the partition recommendations. Next, test the metrics with the existing configuration and the recommended configuration to determine the benefits.
SQL Server 2000 local or distributed partitioned views
Along the same lines as the partition features are local or distributed portioned views. Local views include tables on the same server and distributed views have tables across many servers referred to as a federation of servers. These views can be very beneficial for querying static data housed in multiple locations. The view serves as a means to bring together the horizontal data stored in the underlying tables. For additional information on partitioned views, reference Creating Partitioned Views. In addition, check out the information on Federated Database Servers.
The max degree of parallelism is a SQL Server configuration that enables DBAs to fin>e tune how the optimizer uses CPUs. With max degree of parallelism enabled, which is the default for SQL Server, all of the CPUs can be used in parallel to process query requests. This may be beneficial for a finite number of query requests that are resource intensive. If concurrency is a need in the data warehouse, consider a configuration where a portion of the CPUs is used for parallel processing and the remainder is used for nonparallel processing. This configuration will improve the CPUs' capacity for higher levels of concurrency. To limit queries from using parallel query plans, set the Cost Threshold for Parallelism configuration based on reviewing the query plan cost for representative queries in your environment. For additional information about max degree of parallelism, reference max degree of parallelism Option.
SQL Server database maintenance
Data warehouses need maintenance just like any other relational database engine application. As you begin the project, be sure to schedule regular maintenance windows on a per-load, weekly or monthly basis. A small amount of maintenance such as index rebuilds and update statistics can go a long way toward performance improvements. The design of the underlying tables will determine maintenance time. You can perform the process in short maintenance windows with a number of concurrent maintenance processes or you may require much larger timeframes with very large tables and indexes. You can find more helpful information on SQL Server 2005 database maintenance in SQL Server Integration Services: Simplify database maintenance.
Boosting the performance of your data warehouse is possible at any stage of the project. You can have the greatest gains over the life of the system by first properly gathering the requirements and then building a system to meet the user needs. SQL Server 2005 offers additional benefits to improve currently implemented systems, so be sure to check out those capabilities to benefit your application. Happy coding…
About the author:
Jeremy Kadlec is the principal database engineer at Edgewood Solutions, a technology services company delivering professional services and product solutions for Microsoft SQL Server. He delivers frequent presentations at regional SQL Server users groups and nationally at SQL PASS. He is the author of the Rational Guide to IT Project Management. Kadlec is the SearchSQLServer.com Performance Tuning expert.