In my previous article, I gave you a quick crash course on the steps you need to take to build a data warehouse. This article will dive into the details of the first step: extracting data from various transactional systems.
So your company or your client needs a data warehouse (DW). Where do you start? Believe it or not, the users of your data warehouse will be the ones that do most of the work. As a database architect, you're there to provide the technical expertise--and to do this, you'll need a good understanding of the business. In fact, by the time the DW project is completed, you'll probably know more about the business than the business owners themselves (whether you want to or not). If you have a luxury of having a business analyst (BA) on the project, that's where they come in handy. However, keep in mind that as a database architect and administrator, you'll need to have as good a knowledge of business as any BA.
First, determine who your DW users will be. Most of the time, these will be upper-level managers. Let's be honest: these types of folks can often be intimidating to work with. However, if you can demonstrate that you can help them manage their business better, they will gladly find a good portion of their precious time to talk to you. The types of questions to ask depends on the project goals. However, with any DW project you need to know the following:
- What are the transactional systems used by the company? Whether or not the management wishes to use any of these systems for the DW effort does not matter. Often they don't realize where the most valuable data of the company resides. After all, that's your job.
- Who owns the data? This is a tough one since some of the transactional systems could have been purchased from a third party vendor who does not wish to give out their data structure or the data format. Besides, there might be several people responsible for data upkeep within an organization. When they see a data warehouse architect, the first thing they do is update their resumes--they wonder if you're a sign of things to come. In any case, this is a very important steps so don't take it lightly. Be sure to get not only the names, but also the job titles and duties of each person or organizational unit responsible for data.
- What types of reports are expected to be derived from a data warehouse? This could be by far the most exciting step. This is where you get to interview all the future users of DW and get to know all of their expectations. Do not make a mistake of telling them what a DW could do for them! You're just gathering requirements, so the final result of the project is not yet determined. Be sure to document all requirements with the name and title of the person that requested the report.
- After you've got the data sources and types of reports you need, it's time to go after the data At this point, it's a good idea to generate a list of all the data fields (columns) that you will need to satisfy reporting requirements.
- Once you have a list of fields, look at the source systems. In some cases, if you're lucky, extracting data will be a piece of cake--all you have to do is some formatting and the data is ready to be loaded into your staging area. But approximately 95% of the time this won't be the case.
Let's stop here for a minute and think of the tasks you might have to undertake to extract the data.
First and foremost, you'll have to speak with folks who are responsible for each source system (remember step 2?). Most of the time, after explaining your mission, they'll be happy to grant you access to their systems. However, at times you might have to rely on upper management to get such permissions.
Next, you will have to relate data in multiple systems and try to come up with some sort of mapping of data fields. For instance, the sales information might be kept in one source and the inventory data might be in another data store. If the company needs to analyze their expenses against their revenues you'll have to grab data from both systems and come up with a way to compare inventory costs to the revenue that was generated against that cost.
At this point it is helpful to build an entity relationship diagram (ERD). If you've ever built a relational database before, you should be familiar with the ERD concept. However, the ERD for a DW differs in that it is not really concerned with primary and foreign key relationships. More important are cardinality rules: for example, each store might receive inventory supplies from multiple offices, or each office provides supplies to multiple stores. Therefore, you'll have to build a mapping table that tracks which stores receive supplies from each office. More importantly, this table will tell you which costs to compare to revenues from each store.
When analyzing reporting requirements, you're very likely to find that some of the requested data is not in the systems that you've been advised to utilize for the DW. At this point the list of all transactional systems utilized (from step 1) within the company is your best friend. Go back through this list and find a system that contains the necessary data elements. If you cannot find what you're looking for, you have no choice but to go back to step 1 and interview the business owners again.
When you do find a system that contains the fields you need, try to get at least a small sample data extract from that system. Understanding the data format is crucial and will save you much time when coding your own extract routines and the dimensional data model.
Once you've identified all the source systems--this is not a trivial task and might take anywhere from several weeks to several months, depending on the size of the project--you're ready to build data extract utilities. If you are a strictly database person, not knowing much about programming languages other than SQL, it's time to call your fellow developers. Data Transformation Services with SQL Server 2000 (and 7.0), for example, will help you in many instances, but more often than not, you will have to build your own application that connects to the third party system and extracts certain portions of the data.
Most of the time you'll have to extract data from a mainframe system or a client-server application. However, it's not unusual to find much of company's data located in spreadsheets, desktop databases, or mail systems. If you were looking for a developer to write your extraction routines, experience with one of the source systems would be a huge plus.
Keep in mind that the DW only needs those data elements that need to appear on the reports--nothing more, nothing less. The implication here is that if you extract all of the data from each source system you won't build anything but a very large, unmanageable mess. Conversely, if you don't get all the important data elements, then two weeks after your DW is in production, the business owners will ask for the missing fields and you'll have to rebuild your data extraction utilities. That is why performing step 4 thoroughly is so important. Be sure to allow sufficient time for this step, since not having all data elements will end up costing much more than extra week or two of requirements gathering.
When you get your hands on the extracted data, you have to have a way of storing it in some kind of a staging area. It is best to use the same database engine for the staging area as you will use for the final DW. This staging area should be a relational database, which has all the appropriate constraints (primary and foreign keys) and indexes defined appropriately. However, don't spend much time normalizing the staging database--after all, you won't ever need to worry about multiple users inserting and updating this data. Instead, the staging database should be de-normalized and optimized for querying.
Some data warehousing projects use the staging database as a base for building cubes. Generally, this is not a good idea, because the data in the staging db is in the same shape as in the source systems, which is not optimized for report generation. After data from the source systems is in the staging area you can build your dimensional model and populate it with data. The dimensional model is what the cubes need to be based on. I will discuss the ABCs of dimensional modeling in the next article.
So in a nutshell, the first step of any data warehousing project is a combination of business analysis, staging database architecture, and development of data extraction utilities. Laying some good groundwork at this point will save you many headaches and dollars at later stages of your project.
About the Author
Baya Pavliashvili is a MCSE, MCSD and MCDBA with three years experience with SQL Server.
For More Information
- What do you think about this tip? E-mail the editor at email@example.com with your feedback.
- Have a tip to offer your fellow DBAs and developers? The best tips submitted will receive a cool prize--submit your tip today!
- Ask your technical DBMS questions--or help out your peers by answering them--in our live discussion forums.
- Check out our Ask the Experts feature: Our SQL, database design, Oracle, SQL Server, DB2, metadata, and data warehousing gurus are waiting to answer your toughest questions.