Data warehousing (DW) has been around for approximately 20 years. Even so, many people do not have a good understanding of what DW is for, what it is supposed to do, and what it takes to build one. DW has become a popular term, largely due to the increasing number of companies that would like to take advantage of their historical data to make better decisions. Some credit must also go to Microsoft because SQL Server has made building data warehouses easier than ever. Let's talk about DW in general first.
The goal of DW is to provide your company with an easy and quick look at its historical data. Advanced OLAP (on-line analytical processing) tools let DW users generate reports at a click of a mouse and look at the company's performance from various angles. How far back you look at the data depends on the company. If you're in a manufacturing plant producing thousands of parts a day, it might make sense to build a DW to look at performance each week. On the other hand, if you have a handful of customers each day in a rental business you might wish to look at longer periods of time--for example, the past six months or so at a time. If you're into doing trends analysis you'll most likely want your reports spanning at least several years.
The process of building a DW can be fairly complicated or straightforward. A lot depends on where your data currently reside. Regardless of your data sources, building any DW will involve the following steps:
- Extracting the transactional data from the data sources into a staging area
- Transforming the transactional data
- Loading the transformed data into a dimensional database
- Building pre-calculated summary values to speed up report generation
- Building (or purchasing) a front-end reporting tool
The first step, extracting the data itself, can be the most difficult one to accomplish. Companies that have been in business for a long time (and therefore could benefit the most from a DW) most likely have kept their data in some type of a mainframe or proprietary system. Depending on the type of system, extracting these transactional data can be very difficult. Data Transformation Services (DTS), which is part of SQL Server 7.0 and 2000, lets you import and export data from any OLE DB or ODBC compliant database as long as you have an appropriate provider. However, the reality is that most of the time you won't have the luxury of having such providers in place.
An important follow-up step after data extraction is transforming and relating the data from multiple data sources. The IT environment is not homogenous. Most companies will have their data spread out in a number of various database management systems: MS Access, MS SQL Server, Oracle, Sybase, etc. Many companies will also have much of their data in flat files, spreadsheets, mail systems and other types of data stores. When building a data warehouse you need to relate data from all of these sources and build some type of a staging area that can handle data extracted from any of these source systems. The staging area can be a group of files--especially if you're getting extracts from mainframe systems--or, better yet, it will be a semi-relational database residing on the same server where the dimensional database will be hosted. I call the staging database "semi-relational" because it does not have to be completely normalized.
The third step is building a dimensional database and populating it with data from the source systems. Most of the modern transactional systems are built using relational database models. A relational model is optimized for tracking a large number of transactions and storing the transactional data in an efficient format. However, the relational format is not very efficient when it comes to building reports with summary and aggregate values. The dimensional approach, on the other hand, provides a way to improve query performance without affecting data integrity. However, the query performance improvement comes with a storage space penalty--a dimensional database will generally take up much more space than its relational counterpart. These days, storage space is fairly inexpensive and most companies will be able to afford large hard disks with a minimal effort.
Once you have built a dimensional model you have to find a way to fit the relational data into the format you've just built. This step is not as trivial as it might sound. It might mean combining several fields together, looking up values in one data source depending on the values in another data source, building business logic code to derive values for certain columns, and much more.
The data transformations can be done at one of the two stages: while extracting data from the sources and moving it to the staging area, or while loading data into the dimensional database.
The next step--generating the pre-calculated summary values--has been tremendously simplified by SQL Server Analysis Services (or OLAP Services as it was referred to in SQL Server 7.0). If you've populated your dimensional database and you've built your dimensions correctly, SQL Server Analysis Services does all the aggregate generation work for you. Keep in mind however, that depending on the number of dimensions you have in your DW, building aggregations can take a long time. This is when you need to decide on the dimensional model you wish to use: ROLAP (relational OLAP), MOLAP (Multidimensional OLAP), or HOLAP (Hybrid OLAP). The ROLAP model builds additional tables for storing the aggregates--but this takes much more storage space than a dimensional database, so be careful! The MOLAP model stores the aggregations as well as the data in multidimensional format, which is far more efficient than ROLAP. However, there are some "gotchas" that you need to be aware of when using MOLAP and I'll try to cover these in subsequent articles. The HOLAP approach keeps the data in the relational format, but builds aggregations in multidimensional format, so it's a combination of ROLAP and MOLAP.
Once you've built the dimensional database and the aggregations you can decide how sophisticated your reporting tools need to be. The cost of building a custom reporting (and OLAP) tool will usually outweigh the purchase price of a third party tool. There are several major vendors on the market that have top-notch analytical tools. Consider purchasing one of these suites before delving into the process of developing your own software because reinventing the wheel is not always beneficial or affordable. Building OLAP tools is not a trivial exercise.
This article only covered the very tip of the iceberg. In future articles I'd like to provide more details about each of these steps. Considering the fact that the smallest DW project takes at least six months to complete, you need to do your homework and learn as much as possible about data warehousing before going ahead with such a costly project.
About the Author
Baya Pavliashvili is a MCSE, MCSD and MCDBA with three years experience with SQL Server.
For More Information
- What do you think about this tip? E-mail the editor at email@example.com with your feedback.
- Have a tip to offer your fellow DBAs and developers? The best tips submitted will receive a cool prize--submit your tip today!
- Ask your technical DBMS questions--or help out your peers by answering them--in our live discussion forums.
- Check out our Ask the Experts feature: Our SQL, database design, Oracle, SQL Server, DB2, metadata, and data warehousing gurus are waiting to answer your toughest questions.