A common challenge of working on large business intelligence systems or data warehouse projects is improving timeliness of data. This tip introduces you to change data capture (CDC), a BI feature in SQL Server 2008 that will help you update data in your data warehouse or data mart in near real time, based on notification of data changes in the source transactional systems. This is a fairly narrowly defined problem in data warehouse and business intelligence scenarios, but it's also a very common one.
In the past, SQL Server administrators and developers would typically rely on solutions such as custom T-SQL coding or using SQL replication to achieve similar results. With change data capture in SQL Server 2008, Microsoft offers a native out-of-the-box feature that helps solve data latency issues in ETL processes. As with many features of data management and ETL in data warehousing, Microsoft is a late entrant into the area of CDC. Other products on the market today include Attunity Integration Suite, Informatica PowerExchange CDC and DataMirror, recently acquired by IBM. Using SQL Server 2008 CDC for improving data timeliness from data source to data mart for these examples will be based on the features as they exist in SQL Server 2008.
Why implement the change data capture feature?
Here is a common scenario: You have implemented a data mart or data warehouse by loading data from data sources in SSIS (SQL Server Integration Services). The business analysts have built cubes on that data for business reports. You've negotiated with the business users that loading data into the warehouse nightly will suffice for all of their reporting needs. But once you have begun previewing the solution to beta users, the requirements have changed. The business now has visions of running hourly reports off the cubes.
So, now you have to take your ETL design and refresh the data hourly. The good news is that this means your data warehouse, business intelligence and reporting solutions are having positive effects and are being well received. The bad
news is that perhaps you did not design or architect the system for frequent updates. Now you're faced with the challenge of providing data updates closer and closer to real time. These challenges range from system performance to scalability concerns to user concurrency worries.
A key concept in CDC is to push data from a data source to your data warehouse based on notification from the data source – typically a transactional database system – when the data is updated at the source. This means that not every type of data source is an ideal candidate to subscribe to CDC notifications. Very high-transactional workloads from a stock trading database or order entry system may not be ideal candidates. Better candidates would be data sources such as CRM customer data, product data or other dimensional reference data.
How to enable change data capture
The way to make change data capture work in SQL Server 2008 is to first configure the database to enable the feature. Next, identify a filegroup to store the change data and then decide which tables are to participate in the CDC notifications.
You can then query for the changed data with new CDC functions cdc.fn_cdc_get_all_changes_
To use these functions, replace
SELECT * FROM
In this example, I have chosen to bring back "all" records with the option in the functional so all changes within that LSN range are included. Your other option is "all update old" which will return all changes. It also returns both the row containing the column values before the update and the row containing the column values after the update. To get the from and to LSN values, you should first find the proper LSNs from the map functions described above, which would be:
DECLARE @from_lsn binary(10), @to_lsn binary(10);
SELECT @from_lsn = sys.fn_cdc_map_time_to_lsn('smallest greater than or equal', GETDATE()-1);
SELECT @to_lsn = sys.fn_cdc_map_time_to_lsn('largest less than or equal', GETDATE());
Note the argument in single tick quotes. This tells SQL Server which greater than or less than method to use and to include or not include records from the time query that matches the specified value. Here are the options: largest less than, largest less than or equal, smallest greater than or smallest greater than or equal.
In the end, there really aren't many steps to enable change data capture from your SQL Server 2008 databases and to add this capability into your data collection for your data warehouse. In fact, they are summarized very well in Microsoft SQL Server Books Online.
Choosing the proper data source candidates for enabling CDC is a critical decision when optimizing your
data warehouse or BI solution from a data collection perspective. Other CDC tools have their own nuances of setup and configuration. But as you migrate SQL Server sources to the SQL Server 2008 version, this capability can be added natively. Then you can collect just the changed data from your ETL process using built-in functions from SQL Server and SSIS. Collecting only changed data on shorter intervals is becoming more and more common and will increase data availability to your reporting solution business users.
ABOUT THE AUTHOR
Mark Kromer is an internationally recognized leader in the business intelligence and database communities, having authored articles and blogs for TDWI, Microsoft, MSDN and Technet. Mark has more than 15 years' experience in IT, focusing on database technologies and solutions, including spending the past 2.5 years with Microsoft as the lead product manager for business intelligence customer solutions. Check out his blog at http://blogs.msdn.com/makrom.
Do you have comments on this tip? Let us know.