Change data capture in SQL Server 2008 improves BI reporting accuracy
A common challenge of working on large business intelligence systems or data warehouse projects
is improving timeliness of data. This tip introduces you to change data capture (CDC), a BI feature
in SQL Server 2008 that will help you update data in your data warehouse or data mart in near real
time, based on notification of data changes in the source transactional systems. This is a fairly
narrowly defined problem in data warehouse and business intelligence scenarios, but it's also a
very common one.
In the past, SQL Server administrators and developers would typically rely on solutions such as
custom T-SQL coding or using SQL replication to achieve similar results. With change data capture
in SQL Server 2008, Microsoft offers a native out-of-the-box feature that helps solve
Premium Access
Register now for unlimited access to our premium content across our network of over 70 information Technology web sites.
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States.
Privacy
Dig Deeper
-
People who read this also read...
-
This was first published in August 2008
data latency
issues in ETL processes. As with many features of data management and ETL in data warehousing,
Microsoft is a late entrant into the area of CDC. Other products on the market today include
Attunity Integration Suite, Informatica PowerExchange CDC and DataMirror, recently acquired by IBM.
Using SQL Server 2008 CDC for improving data timeliness from data source to data mart for these
examples will be based on the features as they exist in SQL Server 2008.
Why implement the change data capture feature?
Here is a common scenario: You have implemented a data mart or data warehouse by loading data
from data sources in SSIS (SQL Server Integration Services). The business analysts have built cubes
on that data for business reports. You've negotiated with the business users that loading data into
the warehouse nightly will suffice for all of their reporting needs. But once you have begun
previewing the solution to beta users, the requirements have changed. The business now has visions
of running hourly reports off the cubes.
So, now you have to take your ETL design and refresh the data hourly. The good news is that this
means your data warehouse, business intelligence and reporting solutions are having positive
effects and are being well received. The bad
 |
| More on SQL Server BI and data warehousing: |
|
|
|
|
 |
news is that perhaps you did not design or architect the
system for frequent updates. Now you're faced with the challenge of providing data updates closer
and closer to real time. These challenges range from system performance to scalability concerns to
user concurrency worries.
A key concept in CDC is to push data from a data source to your data warehouse based on
notification from the data source – typically a transactional database system – when the data is
updated at the source. This means that not every type of data source is an ideal candidate to
subscribe to CDC notifications. Very high-transactional workloads from a stock trading database or
order entry system may not be ideal candidates. Better candidates would be data sources such as CRM
customer data, product data or other dimensional reference data.
How to enable change data capture
The way to make change data capture work in SQL Server 2008 is to first configure the database
to enable the feature. Next, identify a filegroup to store the change data and then decide which
tables are to participate in the CDC notifications.
You can then query for the changed data with new CDC functions
cdc.fn_cdc_get_all_changes_ and
cdc.fn_cdc_get_net_changes_. The main difference between these two functions is
that the "get all changes" function will return every change that occurred per row of the
corresponding table in the log sequence number (LSN) range. The "get net changes" function will
return one row that may reflect multiple changes for the data, so you don't receive a distinct row
for every change that occurred since the last query. In addition to these new functions, you must
first decipher the proper LSN for your query by mapping the LSNs to the time parameters with this
CDC system function: sys.fn_cdc_map_time_to_lsn.
To use these functions, replace with the value of the capture instance
you configured to capture source table changes. The usage would look something like this:
SELECT * FROM
cdc.fn_cdc_get_all_changes_HR_Department(@from_lsn, @to_lsn,
'all');
In this example, I have chosen to bring back "all" records with the option in the functional so
all changes within that LSN range are included. Your other option is "all update old" which will
return all changes. It also returns both the row containing the column values before the update and
the row containing the column values after the update. To get the from and to LSN
values, you should first find the proper LSNs from the map functions described above, which would
be:
DECLARE @from_lsn binary(10), @to_lsn binary(10);
SELECT @from_lsn = sys.fn_cdc_map_time_to_lsn('smallest greater than or equal',
GETDATE()-1);
SELECT @to_lsn = sys.fn_cdc_map_time_to_lsn('largest less than or equal',
GETDATE());
Note the argument in single tick quotes. This tells SQL Server which greater than or
less than method to use and to include or not include records from the time query that
matches the specified value. Here are the options: largest less than, largest less than or equal,
smallest greater than or smallest greater than or equal.
In the end, there really aren't many steps to enable change data capture from your SQL Server
2008 databases and to add this capability into your data collection for your data warehouse. In
fact, they are summarized very well in Microsoft SQL Server Books
Online.
Choosing the proper data source candidates for enabling CDC is a critical decision when
optimizing your
 |
| Visit our IT Knowledge Exchange forum: |
|
|
|
|
 |
 |
data warehouse or BI solution from a data collection
perspective. Other CDC tools have their own nuances of setup and configuration. But as you migrate
SQL Server sources to the SQL Server 2008 version, this capability can be added natively. Then you
can collect just the changed data from your ETL process using built-in functions from SQL Server
and SSIS. Collecting only changed data on shorter intervals is becoming more and more common and
will increase data availability to your reporting solution business users.
ABOUT THE AUTHOR
Mark Kromer is an internationally recognized leader in the business intelligence and
database communities, having authored articles and blogs for TDWI, Microsoft, MSDN and Technet.
Mark has more than 15 years' experience in IT, focusing on database technologies and solutions,
including spending the past 2.5 years with Microsoft as the lead product manager for business
intelligence customer solutions. Check out his blog at http://blogs.msdn.com/makrom.
MEMBER FEEDBACK TO THIS TIP
Do you have comments on this tip? Let
us know.
Disclaimer:
Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.
Join the conversationComment
Share
Comments
Results
Contribute to the conversation