If you want to use your company's vast amount of data to design better business intelligence (BI) strategies, you need to understand the nature and context of the data source that generates this information. Before you can create computations, summaries and time-phased aggregates, it's important to spend the time up front by using data-profiling tools to analyze your data, better understand your data source and prevent data-quality problems from emerging later on.
Generally, the best BI architects and the most valuable BI systems are those where 75% or more of the project time is spent on data collection, data cleansing and all aspects of extract, transform and load (ETL). This level of commitment means observing approved common practices, including using a data-profiling tool to examine data sources, identifying patterns in data fields and the best candidates for keys. In particular, data profiling in SQL Server is an important method to analyze information.
Moving beyond building your own data profiling tools
Prior to SQL Server 2008, only companies such as DataFlux and IBM offered complete data-profiling options. You could also, however, hand-code your own data profiler in T-SQL and .NET. In fact, resourceful database administrators have created their own processes to perform identification and profiling of SQL Server data sources. Many T-SQL programmers have scripted personalized routines to identify null counts in address tables, out-of-sequence patterns in zip codes and deduplication routines.
A library post on data quality in SQL Server 2005 is one of my favorites. It was written before Microsoft increased focus on data profiling, and it provides a great example of the complexities involved in hand-coding your own solution.
The Data Profiling task in SQL Server 2008, by contrast, is a Microsoft-produced product that increases efficiency and offers myriad capabilities. It comes as part of SQL Server Integration Services (SSIS), and it eliminates the need to worry about hand-coding specifics. In fact, the pre-built profiles in SSIS are quite similar to the constraints in the manually created expression builder:
@[User::TotalCount] > 0 && @[User::NullCount] == 0 && @[User::DuplicateKeyCount] == 0 && @[User::InvalidDateCount] == 0
As is the case with most SSIS capabilities, the Data Profiler task works only with SQL Server data sources. It is still, however, an effective tool to reduce the number of tedious processes involved in the data collection stages of building a BI solution. Without profiling data sources and providing this level of data inspection, you leave your business solution open to incorrect data, invalid reporting and bad decision making.
Using the SSIS 2008 Data Profiling task and Data Profile Viewer
Once you grasp the importance of profiling data from your SQL Server data sources, it's clear how the SSIS 2008 Data Profiling task and associated Data Profile Viewer can help. Below is an example from the sample database AdventureWorks, where the Column Value Distribution Profiles tab is highlighted:
SSIS includes this group of profiles out-of-the-box, and it proves useful to examine data contents in table fields. In this example from the Data Profile Viewer, I highlighted the City field from the Address table. The top-right-hand panel on the window indicates that the SSIS data profile task found 575 unique values in the City field, and the bottom pane demonstrates the number of occurrences of each unique value. Having this information easily accessible frees time that you would otherwise spend figuring it out manually. In turn you can spend more time comparing the value distributions to determine the best source for your staging tables.
To use the Data Profiling task with your databases, create a SSIS project. You need SQL Server 2008, although the task can be run against data stored in older versions of SQL Server. Following the sample screenshot below, connect your data source from a data flow task to a data profiling task:
The Data Profiling task wizard prompts you to enter information about the source table, and then you will have the option to select from the list of profiles. In the sample figures for this tip, I asked SQL Server to execute each of the profiles available. You can then direct the output to an XML file, which can be opened from the Data Profile Viewer shown below.
You can execute the Data Profile Viewer from the list of SQL Server client tools, but it is also accessible from the SQL Server Integration Services program group. You may notice that the screen shows the executed profiles and the table. The Viewer tool does so by parsing the results of the XML file that has been created by running the SSIS package shown above. For this sample data, I ran the package from the Business Intelligence Development Studio in debug mode and in just a few seconds, the AdventureWorks table's results displayed.
Improvement trend for Microsoft BI features and functionality
This example demonstrates that the Data Profiler task in SSIS 2008 is Microsoft's effort to provide a complete, native toolkit to BI professionals. Microsoft's advances, however, mainly serve to strengthen its own capabilities, so do not expect SSIS- and its BI-centric additions to emerge as a viable replacement for Informatica, DataStage and other preferred options for managing large, heterogeneous data centers. Nevertheless, a feature such as native data profiling does make it possible to build stronger data warehouse and BI solutions in SQL Server without using a third-party technology for filling gaps in functionality.
ABOUT THE AUTHOR
Mark Kromer is an internationally recognized leader in the business intelligence and database communities, having authored articles and blogs for TDWI, Microsoft, MSDN and Technet. Mark has more than 15 years' experience in IT, focusing on database technologies and solutions, including spending the past 2.5 years with Microsoft as the lead product manager for business intelligence customer solutions. Check out his blog at http://blogs.msdn.com/makrom.