Building a data warehouse in SQL Server: Eight tips to get started
Baya Pavliashvili, Contributor
When building a data warehouse using SQL Server technologies, there are some challenges you
might face. I'll offer you some suggestions for overcoming them. Naturally, each environment is
likely to add its own twist to the challenges of building a data warehouse, but most of these
issues are generic enough to apply to many companies embarking on architecting business
intelligence (BI) applications.
- Management support. This is the first and foremost challenge you must overcome. It is
annoying to technical people (including yours truly) because it has to do with politics and not
software. Let's keep in mind who signs your paycheck, however. If management isn't convinced they
need a data warehouse all your technical skills will be in vain.
Unfortunately
Premium Access
Register now for unlimited access to our premium content across our network of over 70 information Technology web sites.
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States.
Privacy
Dig Deeper
-
People who read this also read...
-
This was first published in July 2007
- upper management often thinks of a data warehouse as yet another system they have
to invest in without any immediate return. Your job is to convince the powers that be they're not
paying for bells and whistles – data warehouse will help them make better, more informed decisions
by correlating data scattered across the enterprise. It is not uncommon to totally change the way
an organization operates after a data warehouse application is implemented. Sometimes it helps to
build a proof-of-a-concept (POC) project to demonstrate the power of BI. POC data warehouse will
only contain a small portion of all data
 |
| More on data warehousing: |
|
|
|
|
 |
 |
and will only be exposed to a few users for testing. Even if you
don't have an approval for a POC you should demonstrate parts of the warehouse as they become
available to your users; doing so keeps them interested and assured that the system will help them
be more effective in their work.
- Data availability. Data warehouse usually combines data from multiple data sources,
which aren't necessarily obvious or easily accessible. Depending on the company size, you might
have to travel to multiple offices and speak to numerous employees to figure out where you can get
the data that is necessary to build analytical views desired by your users. Again, depending on the
political climate of your enterprise, you might find it difficult to get your hands on each
necessary data element. Data you work with might contain confidential and highly sensitive details.
Furthermore, some data might only be accessible through reports provided by an outside entity who
will not grant you access to their data source.
How do you overcome this challenge? First, be sure to convince everyone that you're not taking
their job. Data warehouse is supposed to supplement and not replace any existing systems. If you
have management on your side you shouldn't have too many issues with internal employees. But if you
cannot get access to the data source – for unforeseen reasons – then you may have to get creative.
If data is only available on report screens or on paper you might have to find a way to capture
this data perhaps through scraping report screens or scanning documents.
- Complexity of data sources. Sometimes you get lucky and all data elements that you need
reside in the database management system (DBMS) of your choice. More often, data is dispersed
across multiple DBMS, spreadsheets, email systems, electronic documents and even on paper. Yes,
we're in 21st century, but take my word - there are still companies out there who keep certain
information only on paper. It's your job to figure out how to get data into your warehouse from all
disparate sources and give it a common shape.
- Data quality. Many transaction processing applications are thrown together using rapid
development techniques, often by people who have very limited knowledge or experience with the
tools they're using. This is not a derogatory statement; one must start somewhere and often junior
programmers are most affordable resources that a company can use. The trouble is that if the
application does not validate data you're likely to encounter string values that are abbreviated,
misspelled or completely omitted. For transaction level reports, this might not be a huge deal, but
when you're trying to group data and empower your users with the ability to make decisions, data
quality is essential.
For example, take a look at the following customer names:
a. ACME Boot Outlet
b. ACME bt outlt
c. A.C.M.E
d. Boots (ACME)
e. A c m e boot store (outlet)
Human eyes can easily spot that each of the above refers to the same business entity. But to the
computer program, each of these values represents a separate customer. Unfortunately there is no
easy way to correct all bad data using Integration Services (or Data Transformation Services)
packages. SQL Server 2005 Integration Services offers the Fuzzy Lookup Transformation which may
greatly simplify your data cleansing efforts. Even so, you might also have to enlist the help of
clerical workers to correct the data.
- Multitude and technical savvy of your users. Helping a handful of users get to the data
they need in order to make decisions is not a small fish to fry. Historically, data warehouse was
reserved for upper-level management's use. However, data warehousing has slowly made its way to the
masses. It's not uncommon to see data warehousing and BI applications used by users that don't
necessarily make strategic decisions.
Once your users see the power of data warehousing, they'll want to use it for everything from
data entry to retrieving transactional level reports. The larger your user community, the tougher
it becomes to keep everyone happy and to educate them on appropriate uses of your application. BI
applications perform great when used appropriately, but they are not suitable for every business
requirement. Often, you must be the bearer of bad news that data warehouse is not the tool of
choice for the functionality your users desire.
- Slowly changing dimensions. Your application must reflect data changes that occur in
data sources populating the warehouse. Data within dimension tables is particularly prone to
change. But how do you maintain the history of such changes?
- The first and easiest method is to overwrite the existing record without tracking changes.
Fortunately this method is acceptable for many dimensions. For example, if a department name
changes from "finance" to "finance and accounting" you probably won't be required to maintain the
history of such change. However, in a customer or student dimension it is often necessary to keep
track of changing names, marital status, education level, and other attributes - your application
must be able to retrieve current as well as historical values.
- The second method for managing slowly changing dimensions is to create a new record when the
value changes and tag the older record as obsolete.
- The third and final method is to maintain historical value(s) of the changing domain in
separate columns of the same row within the dimension table.
- So far we've covered the handling of
changing values in the data warehouse dimension. But what about Analysis Services dimensions? To
reflect changes in attribute values you must re-process dimensions. But if you process dimensions
fully you'd have to also reprocess cubes, which might be impractical if your cube contains large
volumes of data. Fortunately you can often use the incremental update (process update) option to
avoid reprocessing cubes.
- Changing facts. Normally one would think that records in the fact tables are static –
once the record gets to the warehouse your job is finished, right? Unfortunately the answer is "it
depends." In some cases like in a data warehouse tracking patients' hospital stays, all records are
generally static. If your hospital stay lasted from January 1 to January 5, that record isn't
likely to change.
But consider the retail industry; all sales aren't final – I'm sure you know people who often
take the goods they purchased back to the store for various reasons. Some companies would manage
such transactions as a series of credits and debits that balance each other out. But in other cases
you must either update or delete fact table records, even after they're added to the data
warehouse. For example, if a stock trade is recorded incorrectly, balancing it with a corresponding
negative trade might not be acceptable. Here's another thing to consider: you might not want your
customers to know about problems in your transactional systems. Rather, you want them to see the
data only after it is corrected.
One method for dealing with changing facts is to keep the records in the staging area until the
data is examined for quality and then migrate it to the warehouse. Even the most thorough testing
won't catch all errors in the data sources, however. Occasionally, you might have to refresh cubes
by processing those partitions that contain incorrect data. This is why it's essential to keep your
Analysis Services partitions small so that processing will complete relatively quickly.
Another way to deal with this challenge is through a write-back partition. With cube write-back,
you're not really changing data in the underlying relational warehouse; instead you're adding a
record to a separate partition. When a user queries a particular measure group Analysis Services
combines data from read-only partition with the data in write-back partition and exposes the
result. Not surprisingly, performing such calculations at query execution time imposes additional
overhead on Analysis Server and can cause performance degradation.
- Implementing security. Much like with any other business application, it's imperative to
secure access to your data warehouse and cubes. Each user might only be permitted to see a small
portion of the warehouse, depending on their job title or area of responsibility. Analysis Services
implements security through Windows accounts which you can associate with Analysis Services roles.
You can secure data at a very granular level, all the way down to individual cube cells.
Unfortunately creating a separate role for each user is not only tedious and cumbersome but
could also instigate performance problems – this is particularly true of Analysis Services versions
prior to 2005. The advice is to group your users into a handful of roles, if possible. If every
single user must have her own set of permissions, you'll have to implement a security dimension.
Once you have such dimension every MDX query will have to filter the output based on the identity
of the current user.
Naturally I can't cover every data warehousing challenge in a single tip. Nor can any one person
claim to be familiar with every difficulty involved in building data warehousing solutions.
However, if you're contemplating an attempt to implement such solutions, you should carefully
consider what you're up against. Do your homework and be prepared for a battle.
ABOUT THE AUTHOR
Baya Pavliashvili is a database consultant helping his customers develop highly
available and scalable applications with SQL Server and Analysis Services. Throughout his career he
has managed database administrator teams and databases of terabyte caliber. Baya's primary areas of
expertise are performance tuning, replication and data warehousing. He can be reached at
baya@bayasqlconsulting.com.
Disclaimer:
Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.
Join the conversationComment
Share
Comments
Results
Contribute to the conversation