Most database administrators have been there. There’s been a massive change to many rows or to the schema of a...
table, and there’s a real need to look at what happened to the data before and after those changes. A well-written database app may have its own journaling function, or it simply may be set up in such a way that data is never explicitly overwritten. If not, you can break out backups and compare them to the current version, with all the headaches that implies. But what about native SQL Server functions for finding out what’s different?
Several technologies can be used for SQL Server data preservation -- change tracking, change data capture, replication and database mirroring -- each with its own use and best suited for certain scenarios. I’m not talking here about backup techniques, but rather ways to preserve and access historical versions of data in a given object, such as earlier versions of a row or table.
SQL Server change tracking is simple enough. When activated, it makes a note of which rows in what tables were changed whenever a data manipulation language (DML) action -- insert, update or delete -- takes place, what columns were modified in an update (if column tracking has been turned on), as well as the time and date of the transaction. The tracked information is available via the sys.dm_tran_commit_table system object, which can in turn be used to get table-specific information from specially named internal tables created whenever you enable change tracking on a given table.
By default, changes are tracked for two days. You can modify the length of time that the changes are recorded via the ALTER DATABASE command or the Database Properties pane in SQL Server Management Studio. The minimum time is one minute, and while in theory there’s no maximum, I’d be leery of setting it to more than two days on highly trafficked databases, since there's a possibility the system objects could grow out of control.
The biggest drawback to change tracking is that a lot of information about the changed data -- and the data schema -- is not preserved. For instance, change tracking doesn’t preserve any information about what the value of a given row was before you made changes to it. Only the most recent data for the row is available, and that's in the table itself. If you drop a table entirely, all change tracking information for that table is also dropped.
Another drawback, albeit a minor one, is that change tracking will only work on databases with a compatibility level set to 90 or higher. If it’s less than 90, you can still set the options for change tracking, but trying to obtain tracking information through CHANGETABLE function will produce errors.
To that end, change tracking isn’t useful as a data-preservation measure. It is useful as a low-impact way of gathering statistics about changes to tables. This is handy if you’re trying to get real-world information about the most or least commonly updated columns or tables in a given database and you want to optimize your database design based on that.
Change data capture
SQL Server change data capture differs from change tracking in that it preserves both the DML changes made and the data that was modified. Unlike change tracking, the captured data is stored in a format that duplicates the structure of the table being captured (barring some additional columns for storing metadata). The results are published through a series of table-valued functions, and changes are tracked asynchronously via the transaction log, so the tracking process has little impact on system performance. By default, three days’ worth of data is retained, although this can be ramped up.
Change data capture also makes a slightly more robust attempt to store information about schema changes during the capture process. Any columns added during change data capture are not recorded; any dropped columns will continue to be captured with NULL values. But changes to the schema are also tracked in a separate table, so it’s possible to create more than one capture table for a given table if its structure has changed during the capture process. So if columns have been added or dropped, for example, you'll have capture instances for both the old and new structure of a given table. But you can't have more than two capture instances active on a table at any one time.
Change data capture’s utility as a data-preservation system is much more useful than mere change tracking, since you have access to the modified data through the change data capture stored procedures. But it doesn’t really work as a shadow-copy version of the whole database, which lets you specify a point in time and see a snapshot of past data. (In theory someone could write an application to do that using the change data capture application planning interfaces, but it’s not something you can do out of the box.)
Also, there are some limits on how this works with other methods discussed here. Change data capture stores its information as system tables, which can’t be replicated -- the change-tracking information is only accessible on the original system.
One thing change data capture is useful for is incremental loading and synchrony of data -- something Microsoft makes a case for. To that end, change data capture is also useful when you need to have a continuous stream of changes analyzed by some other resource, on top of being able to see a single point-in-time state of a given piece of data.
One of the benefits of replication is that it lets you consolidate data from multiple sources. So if you’re trying to preserve data states from more than one database or data provider, replication allows you to cover some of that territory. Its value as a SQL Server data-preservation methodology, though, varies depending on the type of replication you’re using.
Transactional replication pushes changes to the target database (the "subscriber") whenever a transaction is concluded in the source database (the "publisher"). This allows for low-latency updates to the subscriber and access to intermediate changes (such as the state of the data, pre- and post-triggers). Microsoft advises that transactional replication is best when you require read-only copies of a database.
Merge replication is the same as transactional replication except that the replication is a two-way street: Changes are tracked from both publishers and subscribers, and conflicts between the two are resolved automatically. However, merge replication doesn’t track intermediate data states; it only tracks the final value for a given row.
As far as SQL Server data preservation goes, both of these modes are best suited for when you need to have an incrementally updated copy of a whole object kept somewhere. They’re not quite as useful for preserving the individual state of an object at a designated moment in time. For that you need snapshot replication. As the name implies, snapshot replication takes a snapshot of an object and replicates its state at that moment in time, providing a complete refresh of the data. It’s good way to do moment-in-time data preservation, but at the cost of the overhead required to take the snapshot and push it to the subscriber. You’re essentially making a copy of the whole object, so that incurs a cost, and may be best for individual tables that aren’t too large to begin with. Also, there is also little in the way of a history of object changes, as per change tracking or change data capture; the snapshot is static. The only way to derive any historical information is to compare one snapshot with another, which is potentially quite unwieldy.
From the outside, SQL Server database mirroring resembles replication. Two copies of a database are maintained on separate computers, and updated in synchrony. The difference is that the mirrored copy is used in the event of a database failure, so it’s best used if you are most concerned with preserving availability of data against possible losses of connectivity incurred by downtime if a server goes offline. In this case, the availability goes both ways: it’s not just about getting data from the database in question, but being able to publish changes back to it. (The data is being preserved, but not in the sense that row-level changes are tracked.)
One advantage of having all these technologies on hand is they can be mixed and matched depending on your SQL Server data-preservation needs. Change data capture can be used to itemize changes to particular rows of data -- what was overwritten and when -- while replication can be used to ensure that the most current copy of all data is available in a uniform way. The main tradeoff with all these methodologies is between having a single, consistent view of the data and a continuous stream of information about its changes. The key is knowing what level of preservation you need and in which contexts.
ABOUT THE AUTHOR
Serdar Yegulalp has been writing about computers and IT for more than 15 years for a variety of publications, including InformationWeek and Windows Magazine.