In my previous article, I gave you a quick crash course on the steps you need to take to build a data warehouse. This article will dive into the details of the first step: extracting data from various transactional systems.
So your company or your client needs a data warehouse (DW). Where do you start? Believe it or not, the users of your data warehouse will be the ones that do most of the work. As a database architect, you're there to provide the technical expertise--and to do this, you'll need a good understanding of the business. In fact, by the time the DW project is completed, you'll probably know more about the business than the business owners themselves (whether you want to or not). If you have a luxury of having a business analyst (BA) on the project, that's where they come in handy. However, keep in mind that as a database architect and administrator, you'll need to have as good a knowledge of business as any BA.
First, determine who your DW users will be. Most of the time, these will be upper-level managers. Let's be honest: these types of folks can often be intimidating to work with. However, if you can demonstrate that you can help them manage their business better, they will gladly find a good portion of their precious time to talk to you. The types of questions to ask depends on the project goals. However, with any DW project you need to know the following:
Let's stop here for a minute and think of the tasks you might have to undertake to extract the data.
First and foremost, you'll have to speak with folks who are responsible for each source system (remember step 2?). Most of the time, after explaining your mission, they'll be happy to grant you access to their systems. However, at times you might have to rely on upper management to get such permissions.
Next, you will have to relate data in multiple systems and try to come up with some sort of mapping of data fields. For instance, the
To continue reading for free, register below or login
To read more you must become a member of SearchSQLServer.com
');
// -->

sales information might be kept in one source and the inventory data might be in another data store. If the company needs to analyze their expenses against their revenues you'll have to grab data from both systems and come up with a way to compare inventory costs to the revenue that was generated against that cost.
At this point it is helpful to build an entity relationship diagram (ERD). If you've ever built a relational database before, you should be familiar with the ERD concept. However, the ERD for a DW differs in that it is not really concerned with primary and foreign key relationships. More important are cardinality rules: for example, each store might receive inventory supplies from multiple offices, or each office provides supplies to multiple stores. Therefore, you'll have to build a mapping table that tracks which stores receive supplies from each office. More importantly, this table will tell you which costs to compare to revenues from each store.
When analyzing reporting requirements, you're very likely to find that some of the requested data is not in the systems that you've been advised to utilize for the DW. At this point the list of all transactional systems utilized (from step 1) within the company is your best friend. Go back through this list and find a system that contains the necessary data elements. If you cannot find what you're looking for, you have no choice but to go back to step 1 and interview the business owners again.
When you do find a system that contains the fields you need, try to get at least a small sample data extract from that system. Understanding the data format is crucial and will save you much time when coding your own extract routines and the dimensional data model.
Once you've identified all the source systems--this is not a trivial task and might take anywhere from several weeks to several months, depending on the size of the project--you're ready to build data extract utilities. If you are a strictly database person, not knowing much about programming languages other than SQL, it's time to call your fellow developers. Data Transformation Services with SQL Server 2000 (and 7.0), for example, will help you in many instances, but more often than not, you will have to build your own application that connects to the third party system and extracts certain portions of the data.
Most of the time you'll have to extract data from a mainframe system or a client-server application. However, it's not unusual to find much of company's data located in spreadsheets, desktop databases, or mail systems. If you were looking for a developer to write your extraction routines, experience with one of the source systems would be a huge plus.
Keep in mind that the DW only needs those data elements that need to appear on the reports--nothing more, nothing less. The implication here is that if you extract all of the data from each source system you won't build anything but a very large, unmanageable mess. Conversely, if you don't get all the important data elements, then two weeks after your DW is in production, the business owners will ask for the missing fields and you'll have to rebuild your data extraction utilities. That is why performing step 4 thoroughly is so important. Be sure to allow sufficient time for this step, since not having all data elements will end up costing much more than extra week or two of requirements gathering.
When you get your hands on the extracted data, you have to have a way of storing it in some kind of a staging area. It is best to use the same database engine for the staging area as you will use for the final DW. This staging area should be a relational database, which has all the appropriate constraints (primary and foreign keys) and indexes defined appropriately. However, don't spend much time normalizing the staging database--after all, you won't ever need to worry about multiple users inserting and updating this data. Instead, the staging database should be de-normalized and optimized for querying.
Some data warehousing projects use the staging database as a base for building cubes. Generally, this is not a good idea, because the data in the staging db is in the same shape as in the source systems, which is not optimized for report generation. After data from the source systems is in the staging area you can build your dimensional model and populate it with data. The dimensional model is what the cubes need to be based on. I will discuss the ABCs of dimensional modeling in the next article.
So in a nutshell, the first step of any data warehousing project is a combination of business analysis, staging database architecture, and development of data extraction utilities. Laying some good groundwork at this point will save you many headaches and dollars at later stages of your project.
Part 1: The basics
Part 2: Data extraction
Part 3: Dimensional modeling
About the Author
Baya Pavliashvili is a MCSE, MCSD and MCDBA with three years experience with SQL Server.
For More Information