In my previous article, I described the steps you need to go through to gather data from various data sources....
After the extraction is completed, it's time to load the data in a format that will be most beneficial for reporting purposes.
If you've modeled relational databases before, you have some advantage, because you're aware of the complex thought process it requires. However, the normalized relational model will not work when designing a data warehouse. The normalized model tries to make the data entry as efficient as possible, since that's what the transactional databases are supposed to be used for. When normalizing, you try to get rid of any redundancies and attempt to store each data element in only one place.
The model you need to build for a data warehouse is dimensional, rather than relational. It is based on the measures and dimensions your users would like to see on their reports. Once you know what dimensions and measures your warehouse needs to support, you're three-quarters of the way through modeling your application.
Dimensional modeling, like relational modeling, is not a trivial exercise and should not be taken lightly. Many organizations have failed to allocate appropriate resources and time for modeling and ended up rewriting their systems several times. On the other hand, laying a good groundwork when modeling your databases will pay high dividends down the road when you get ready to deploy your applications.
A successful dimensional modeler has to have the following qualifications:
- A very good understanding of the business rules associated with the data.
- A good understanding of the data. That means, you need to know and document what data fields are required on each report, as well as where you get each piece of data; be sure to document what those fields are referred to in the original data source.
- Knowledge of the principles of the dimensional modeling.
- Knowledge of the particular database software utilized for deploying the data warehouse.
Let's consider each one of these points individually. The majority of the business rules will be built into the transactional systems, not the data warehouse. However, if you don't know the business rules, you can't relate various data elements and you'll have a hard time figuring out what it is that users are trying to accomplish with the data warehouse. For instance, suppose you're designing a data warehouse for a chain of retail stores. You will need to know their business cycle (such as where and how often they get their products), how they define their pricing structures, how they keep their records (that is, when their fiscal year starts and ends), how the data moves from each retail location to the central data store, etc.
The modeler has got to know every little detail of the data coming into the warehouse from transactional sources. Perhaps the most helpful exercise for this point is to build a data dictionary, documenting each data element that comes from the transactional systems and how it should be represented on the reports. This is where you need to be very detail oriented if you want to get it right. If there are any abbreviations used in the transactional systems (and there will be many), you need to know exactly what they stand for and what they mean to the end users of the data warehouse. The abbreviations used in the transactional systems might mean a lot to the folks that have designed those systems and the managers who have worked with that data for years. But what about the new employees? How about the employees that might be transferred from other departments? In general you should avoid presenting the abbreviated data elements on the reports if at all possible.
Also be careful when generating derived values. These might be common in the transactional system, but moving them to the warehouse will not be beneficial. For example, a retail store chain might use a combination of a region identifier and a store number to determine the type of goods sold in each store. Would it be a good idea to hard code this type of logic in your data transformations? Absolutely not. What if the chain adds new stores or even new regions--then what? If you hard code these types of rules, you'll have to modify the transformation logic each time the retail chain adds a new store or a region. Instead, you should advise the owners of the transactional systems to add a data element to their tables that will inform your code about the types of goods sold in each store. As an alternative, you can include such columns in your staging database.
Once you have a solid grasp of your data and the business rules involved it's time to generate a dimensional model. The heart of each dimensional model is the star schema, which gets its name from the way it looks. The star schema is composed of a giant fact table and multiple small dimension tables. The fact table is joined to the dimension tables through the key of each dimension. In addition, the fact table will contain measures--the numeric columns that present the count or summation of particular values that data warehouse users would like to see on their reports.
Let's think of the retail chain example again. The users might wish to see how much sales were generated by each store per each quarter of each year of operation for the last three years. They would also like to know who their best salespeople are and how their cost compares to the revenues they generate. In such case the measures would include the count of sales, dollars of revenue and dollars of cost. The dimensions will include store, employee, and time. So the components of the star schema for this simple example would include the following:
Fact_table Employee_key Time_key Store_key Dollars_of_revenue
Count_of_sales Dollars_of_cost (Other columns) Employee_dimension
Employee_key Last_name First_name (Other columns) Store_dimension
Store_key Region_name Store_number (Other columns) Time_dimension
Time_key Year Quarter Month Day (other columns)
Each dimension table might be as fine-grained or as coarse-grained as your data sources allow and as your users demand. For instance, if you have the cost data per hour, per employee you can include a column for each hour of the day in the time dimension. Whether sales per each productive hour really make much sense for reporting is another question. Similarly, if your business users wish to see the store phone number, address, and the store manager's name on their reports you can include those data elements in the store dimension. Of course this example is over-simplified for the sake of the introductory article. You'll have to do much more thinking and analyzing about your business rules before you can come up with the model. For instance, in some cases it might make sense to include employees as part of the store dimension. On the other hand, each salesperson might be allowed to work in multiple stores--that's where it pays to know the business rules.
The key of the fact table is usually the combination of all of the dimension keys. In the above example the key would be composed of the store_key, employee_key and time_key columns. However, this might not always be the case, again, depending on the business rules. For instance, if the same employee is allowed to work in multiple stores at once and the lowest level of the time dimension is a day, then you'd end up with Mr. Smith producing $400 worth of sales in store number 2 and $200 of sales in store number 4 on February 18th of 2001. If that is the case, then you'll have to keep looking for a unique identifier of some kind in the source systems--after all, that's where you get the data for your warehouse. If all else fails you can always generate your own key, but consider that as a last resort solution.
There are many other "gotchas" with dimensional modeling; however, I can't take time to cover them all in an introductory article. Just keep in mind that identifying your dimensions and measures is the biggest portion of the modeling task. The dimensions represent ways to "slice and dice" the data. In the example you just saw the managers wanted to see the data broken down by time, employee and store--those were the dimensions. The measures are numeric (or monetary) facts that the data warehouse users would like to see on the reports: the managers wanted to know who is generating the most sales (dollars_of_revenue), who has the highest number of sales (count_of_sales) and how much cost is associated with each (dollars_of_cost).
Last, but definitely not least, the modeler needs to have a good background of the database software he or she is to use for building a data warehouse. Although sometimes overlooked and underestimated, Microsoft SQL Server is fine database software when it comes to designing data warehouses. Microsoft provides Analysis Services as part of SQL Server 2000 at no additional cost. Analysis Services and SQL Server will make your job much easier than any other tool when designing pre-calculated values and aggregations for your data warehouse, as well as when automating cube-building tasks. Therefore, if you haven't already decided which database software to use, take a good look at what Microsoft SQL Server has to offer.
There are number of excellent books and on-line resources that will help you in learning SQL Server. You might also consider attending a class or two, depending on your budget and time constraints. However, most of your knowledge will have to come from experience, you just have to have the attitude and aptitude for learning while working with this product. The particular SQL Server skills that you might need for data warehousing will include a solid background in Transact SQL, knowing how SQL Server stores, uses and maintains indexes and data, as well as database maintenance and administration tasks. If most of your transformation logic is enclosed in Transact SQL (which is SQL Server's specific flavor of the Structured Query Language), some background in performance tuning will also be beneficial.
In this article I've introduced you to dimensional modeling concepts and the skills you need to acquire to be a successful data warehouse modeler. In the next article I'd like to discuss the details of building cubes and aggregations with SQL Server 2000.
About the Author
Baya Pavliashvili is a MCSE, MCSD and MCDBA with three years experience with SQL Server.
For More Information
- What do you think about this tip? E-mail the editor at firstname.lastname@example.org with your feedback.
- Have a tip to offer your fellow DBAs and developers? The best tips submitted will receive a cool prize--submit your tip today!
- Ask your technical DBMS questions--or help out your peers by answering them--in our live discussion forums.
- Check out our Ask the Experts feature: Our SQL, database design, Oracle, SQL Server, DB2, metadata, and data warehousing gurus are waiting to answer your toughest questions.