SQL Server’s Data Quality Services is a data-cleansing feature that will be introduced with the upcoming SQL Server 2012 but is now available in a release candidate edition. The application lets database administrators set up a standard set of values for an existing table. It then checks all values entered by users against that set, changing the nonstandard ones.
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
As seen in Figure 1, the various accepted values for the U.S., including the standard numeric value 840 as well as “US,” “USA,” and “United States of America” are all converted to “United States” when the Data Quality Services client is run against a table in the database.
Figure 1. The Domain Values tab of the SQL Server Data Quality Services client application shows the specific values stored within the domain.
Data Quality Services, also known as DQS, can process a variety of values right of out of the box. It can convert country names and abbreviations into either two- or three-letter abbreviations and can correct the names of the counties in the U.S. DQS can also work with much larger data sets as well, such as people’s last names and U.S. place names, a large data set that includes cities, towns, villages, Air Force bases and more.
Microsoft supplies a number of data sets, which it calls domains, but the application isn’t limited to using just those. The system can be fully extended with new domains if the pre-existing ones don’t fit your needs. Adding new domains into the system is easy. You can populate the domain by typing the data into the DQS client application or loading it into an Excel sheet.
When you run data through Data Quality Services, the system actually learns more about your data, adding new matching values from the data set into the values stored within the domain. That way, as the data set is processed again and again, fewer mismatches will be found.
Adding a new domain in Data Quality Services
To create a new domain, you’ll need a knowledge base, which is a set of domains grouped together. Open an existing knowledge base, such as the included default DQS Data, or create a new one. To edit a knowledge base, hit the Open Knowledge Base button from the home screen, select the knowledge base you want to edit and click on Next.
Once the Domain Management screen opens, click on the new-domain button, circled in Figure 2.
Figure 2. On the Domain Management screen, click on the new-domain button.
When the Create Domain window opens, as shown in Figure 3, name the domain and give it a description. You can also specify the data type that will be used and whether the string should be normalized. The domain can also be configured to store the data as entered, in upper case, lower case or with each word capitalized. You can also specify the language in which the data will be entered.
Figure 3. The Create Domain window allows users to create a new data set.
Now you’re ready to configure the domain. New domains can be configured from a set of base reference data on the Reference Data tab. This requires having a DataMarket account key from the Microsoft Azure online store. Once you’ve added data to your online profile, you can download the reference data into the domain. On the Domain Rules tab you can set up specific rules that determine which data should be processed through the domain. The rules define which rows are processed by the Data Quality Services and can be as simple as specifying a length or as complex as matching against a regular expression.
You can enter specific domain values on the Domain Values tab. Import data from an Excel sheet as shown in Figure 4 or enter the data manually using this tab. One thing to note is that when using a 64-bit installation of Excel, XLSX files are not supported for import; only XLS and CSV files are supported.
Figure 4. This screenshot of an example Excel document shows the data that will be loaded into Data Quality Services for matching.
After an Excel sheet is imported, the values will show on the Domain Values tab, as shown in Figure 5. At this point the domain can be pushed into SQL Server for storage by clicking on Finish.
Figure 5. The Data Quality Services client displays values after the data is imported.
Using the Data in Data Quality Services
You can now use the server to process data. First create a Data Quality Project on the home screen of the Data Quality Services client application and select the data you want to process. You can store in a SQL Server database, as shown in Figure 6, or within an Excel sheet. Figure 6 shows mapping data from the Adventure Works 2008 R2 sample database using the Person.CountryRegion table. (Note that the Table/View drop-down menu doesn’t include the table schema.)
Figure 6. This screenshot shows the mapping of columns from the table that will be matched with columns in the DQS domain.
Clicking Next will take you to the Cleanse screen. On this screen the application will download all the values from the table and run the data through the cleaning process. When the process is complete the screen will tell you how many values are correct, how many were corrected and how many need to be corrected by confirming the suggested values or specifying new values, as shown in Figure 7.
Figure 7. Data Quality Services shows the results of the data-cleansing process.
On the next screen, you can review all of the recommended and completed changes and approve or reject the changes. You can then export the values to another database for use within the database application, as shown in Figure 8.
Figure 8. The export screen gives a list of recommended and complete changes. Here you can approve or reject the changes.
The data-cleansing process in SQL Server Data Quality Services is pretty easy. It gives you nice, standardized output that can be used throughout your application. Once the exported data is back in the SQL Server database, it can be moved back into your production tables using traditional Transact-SQL. This process gives users a lot of flexibility -- they can use any method they like -- while the data can be easily cleaned without having to write your own algorithms.
ABOUT THE AUTHOR
Denny Cherry is an independent consultantwith more than a decade of experience working with SQL Server, Hyper-V, vSphere and enterprise storage systems. A Microsoft Certified Master and MVP, Cherry has written books and technical articles on SQL Server management and integration. Check out his blog at SQL Server with Mr. Denny.