Integration Services in SQL Server 2008 provides three new features for better performance and ease of use for administrators and developers. In this tip, SQL Server BI expert Mark Kromer discusses how data integration has been kicked up a notch with improved scripting capabilities, data profiling and data flow.
SQL Server has become a very large and all-encompassing enterprise database package over the years, particularly with the addition of the business intelligence features: Analysis Services, DTS and now SSIS (SQL Server Integration Services). With SQL Server 2008, you get off-the-shelf developer tools (BIDS), administrator tools (SSMS), performance tools, systems for data warehousing, business intelligence, report-building services and so on.
That's good and bad news for those whose everyday task is to administer and manage a SQL Server environment. In this tip, I'll focus on the data integration tool, SQL Server Integration Services (SSIS), that ships with SQL Server 2008 and three new exciting features. These new features in SSIS 2008 are improved scripting, data profiling and data flow. There are other improvements in SSIS 2008, including change data capture (CDC) and merge capabilities in SQL Server 2008, that I will not have the space to detail here. To find out more about CDC, visit my previous tip on how the change data capture feature improves BI reporting accuracy.
Let's start with the new C# scripting capabilities in SSIS 2008. In the SQL Server 2005 release of SSIS, scripting tasks were only available in Visual Basic as part of the VSA (Visual Studio for Applications) component in SSIS. After a lot of customer requests for making C# an optional task scripting language, Microsoft now makes that available by fully incorporating Visual Studio Tools for Applications (VSTA) into SSIS as a script task (see Figure 1). Now you have access to the full .NET Framework and can perform much more complex actions through the scripting task. Drop the scripting task onto the Control Flow design surface and you will be prompted to select either Visual C# or Visual Basic as your language of choice. So now SSIS programmers and administrators are able to move away from VB scripts like this:
'Set new internal variable to DTS/SSIS package variable
Dim fromDate As Date = Me.Variables.FromDate
Do While fromDate <= Me.Variables.ToDate
fromDate = DateAdd(DateInverval.Day, 1, fromDate)
…to the somewhat more elegant new C# variant:
DateTime fromDate = this.Variables.FromDate; // From SSIS
While (fromDate <= this.Variables.ToDate)
Figure 1: New Script Task in SSIS 2008 is VSTA and includes both Visual Basic as well as Visual C# scripting.
One of the most exciting new components in SSIS 2008 is the data profile task. With this new feature, you can sample SQL Server data and create profiles of that data to look for patterns, look for NULLs and duplicate data, among other capabilities. A major drawback of this first data quality tool to be made available by Microsoft is that the only data source it works with is SQL Server -- at this time.
There are many uses for data profiling, including analyzing statistics from the output and identifying candidate keys, but for this tip, let's focus on data movement in business intelligence scenarios.
A data profiler is something that is very helpful in the time-consuming and tedious exercise of mapping data from one SQL Server database to another or when looking for the primary candidate for fields such as address, name or product in building dimensions for your data warehouse.
Many standalone ETL products on the market today include similar functionality and it is almost a necessity in the toolbox of data integration specialists. It is good to see Microsoft include data profiling in SSIS. But do not look for SSIS to become a standalone powerhouse ETL tool on its own anytime soon. This is likely to remain a tool attached to SQL Server and part of the overall SQL Server package value.
What you see in Figure 2 is the result of a data profiler task that has collected data samples from a database. When the task has completed in SSIS, you open up this viewer to display the results. The output from the package will be an XML file, and the best way to view and make sense of the output is to execute this viewer. The viewer is accessed outside of Visual Studio as a standalone application that is installed as a client tool with Integration Services. This sample output in Figure 2 is a look at the column length distribution profile.
Next is an enhancement to existing functionality. Microsoft has made buffer and caching improvements to its SSIS data flow components. Additional asynchronous capabilities in data flow components will easily provide low-hanging fruit performance gains in your SSIS pipeline. If you're familiar with SSIS in 2005 and have used data flow components in large, complex packages, then you've undoubtedly run into situations where performance quickly degrades when piping data. Most likely you were forced to make design decisions on your pipeline based on emphasizing parallelism. Well, the good news with SSIS 2008 is you don't need to worry as much about these issues and you can begin eliminating intricately designed performance-gain tricks.
Most of the bottlenecks I've seen in past SSIS routines occur when it's performing too many row-level operations or blocking components such as aggregates and sorting or when data needs to be copied to a new buffer instead of passing the same buffer onto the next transformation. In SSIS 2008, there is now the cache transform, which you can use to persist data throughout your dataflow. This allows you, for example, to reuse this cache in multiple lookups without requiring the SSIS engine to repopulate the cache each time. Figure 3 shows setup screens for the cache transform, which I have attached to my data source:
This technique can be used as a nice timesaver and the improvements in data flow components are significant, but you still must test and benchmark your packages for performance.
SQL Server 2008 and SSIS 2008 are not revolutionary releases for Microsoft. Instead, they expand upon the breakthrough foundation of SQL Server 2005 and continue to move the product toward becoming a more complete enterprise-ready, industrial-strength database option for businesses. As Microsoft moves further away from the old Sybase/SQL Server 7.0 years, SQL Server perceptions continue to improve, and those in the data warehouse and business intelligence areas will benefit from the increasing functionality, ease of use and performance that SSIS 2008 has for data integration. SSIS 2008 is still not a standalone ETL option because it does not include critical functionality such as managing metadata or high-speed connectors for the plethora of heterogeneous data sources in the market today.
But another real advantage that SQL Server administrators and developers should be aware of is that migrating from SQL Server 2005 SSIS packages to SQL Server 2008 packages is going to be nowhere near the pain involved for those that went through the fun of migrating from DTS in SQL Server 2000 to SSIS 2005. The fundamentals of the server and engine are, thankfully, the same this time around.
ABOUT THE AUTHOR
Mark Kromer is an internationally recognized leader in the business intelligence and database communities, having authored articles and blogs for TDWI, Microsoft, MSDN and Technet. Mark has more than 15 years' experience in IT, focusing on database technologies and solutions, including spending the past 2.5 years with Microsoft as the lead product manager for business intelligence customer solutions. Check out his blog at http://blogs.msdn.com/makrom.
Do you have comments on this tip? Let us know.