Problem solve Get help with specific problems with your technologies, process and projects.

Data cleansing techniques in SQL Server

Data that is collected and used for decision-making purposes must be accurate from the start in your SQL Server environment. Refer to this checklist of data cleansing techniques and ensure data accuracy.

When it comes to decision making, the value of data is in its accuracy. If the data is inaccurate, decision makers can make flawed assumptions, which could have disastrous consequences. In this tip, I will outline techniques to ensure data accuracy, including:

  • Collection validation
  • Referential integrity
  • Lookup table usage
  • Leverage the application logic
  • Cross checks
  • Mapping tables
  • Summary validation
  • Spot check validation
  • Document the norms
  • NULL value management
  • Validation logic
  • Exception handling and remediation
  •  Checklist: Data Cleansing Techniques
    Collection validation
    One of the easiest means to ensure data is accurate for reporting and decision making purposes is to make sure accurate data is collected from the start. This makes life much easier for the systems downstream because they will not need to write validation logic during the data-import process. You may encounter some difficulty depending on the business, so consider these options:
  • Validate all input data. Make sure it meets a reasonable set of rules.
  • Limit the data that is entered in free form. Use a check box and drop down box paradigm.
  • Determine how you can offer employee incentives for accurate data input.
  • Referential integrity and constraints
    I am surprised by the number of systems that do not use the native referential integrity and constraint features of SQL Server. These techniques, by default, are a reasonable means to ensure data is accurate between the tables. They also impose rules so only accurate data is provided within a set of boundaries. This is a simple method to see that your data is not orphaned and meets a business rule. By using the native referential integrity and constraints, the downstream systems can pay less attention to validating dependent data and correcting business rules that have been violated.
    Lookup table usage
    Lookup tables provide measures to ensure a known value is used in an application or an ad-hoc value in a script instead of a hard-coded value. Lookup tables can prevent the application or user from entering an incorrect value, which downstream systems would consequently need to handle. The lookup tables offer a standardized data value and simplify the downstream data-handling approach.
    Leverage the application logic
    A common occurrence in some systems is a mass update or bulk data load process. These processes are typically much more efficient than having users enter the data. But the business logic cannot be overlooked, so these processes must take the same steps the application would to ensure the business logic is applied. If not, the downstream systems will need to recognize that this data has not been processed with the same business rules as the direct application data, and it needs to be validated.
    Cross checks
    Depending on the data and platform, a natural means may be in place to perform checks and balances on the data. For example, if the system is financially driven, credits must equal debits. For manufacturing systems, orders placed and orders shipped should be equal. Based on your system, determine which checks and balances system you'll need to reconcile the data.
    Mapping tables
    Many systems that are responsible for decision making gather their data from a number of source systems. Source systems typically have a variety of data values mapping to a single value. To ease this reality, consider building mapping tables in the reporting system between original values and the final values. This should improve the designation between valid values and other values that are out of scope.
    Summary validation
    Validate the data at a summary level. If the summary values are in sync, then no additional validation and remediation is needed, or else you can execute another process to determine the inaccurate detailed data. Validating the data at a high level can be a simple and accurate means to guaranteeing data accuracies.
    Spot check validation
    When it comes to good old fashion data cleansing, not much is better than performing a spot check of the data. Although this is a manual process, the trained eye could find errors quickly and point out opportunities to correct the data and avoid the manual process. This can be the best approach and should be considered from time to time.
    Document the norms
    To help determine if a problem is occurring or if logic has changed in upstream systems, document the values in the norm and compare the current values to the expected normal values. If these diverge, then determine the accurate values and address the system requiring correction.
    NULL value management
    If there is any value that can cause havoc, it is NULL. Have a firm understanding of how the systems use NULL and make sure everyone understands the value of NULL. If you use NULL in different contexts across the business or in specific processes, then consider building lookup tables to support business needs. In addition, as a DBA or developer, be sure you understand how NULL will potentially impact calculations or logic that you have in your code.
    Validation logic
    If you could not implement some or all of the earlier processes, then it boils down to building accurate validation for the system that is responsible for the final reporting. The process should start with data analysis from various systems to a single set of agreed upon values. One approach is to map the known values to a consolidated set of values. That should improve the data cleansing process. Next, work toward correcting the unknown or incorrect values. Then, you can correct these values separately. Once you address the problem from a business and process perspective, you can address it from a technology perspective with SQL Server Integration Services or Data Transformation Services and custom logic to meet your needs.
    Exception handling and remediation
    The reality is that some exceptions will occur during the data cleansing process -- especially if you work with a number of disjointed systems. Expect and handle these exceptions properly. During the data cleansing processes, create a separate set of tables for the errors or exceptions. Based on the data, build a process to analyze this data and update the data cleansing code in SQL Server Integration Services or Data Transformation Services to meet these needs. Keep in mind that if a condition was initially an exception, the final result may be a typical business case, which was unexpected.

    Jeremy Kadlec is the Principal Database Engineer at Edgewood Solutions, a technology services company delivering professional services and product solutions for Microsoft SQL Server. He has authored numerous articles and delivers frequent presentations at regional SQL Server Users Groups and nationally at SQL PASS. He is the author of the Rational Guide to IT Project Management. Jeremy is also the Performance Tuning expert. Ask him a question here.
    Copyright 2006 TechTarget

    Dig Deeper on Microsoft SQL Server Performance Monitoring and Tuning