Handle slowly changing dimensions with SSIS 2005 wizard
Baya Pavliashvili, Contributor
In previous versions of SQL Server, you had to manage the process of implementing slowly
changing dimensions through custom logic, often embedded in Data Transformation Services (DTS)
packages. Within a DTS package, you had to manually compare each relevant column's value between
the dimension table and the corresponding data source. If any differences were detected, you'd have
to decide whether you wanted to create another row or overwrite the existing row to manage slowly
changing dimensions (SCD). SQL Server 2005 Integration Services has a transformation that makes
implementing SCD a matter of running a wizard.
Slowly changing dimensions are a part of almost every data warehouse. Rarely does one see
dimensions whose member values remain constant. For example, customer
Premium Access
Register now for unlimited access to our premium content across our network of over 70 information Technology web sites.
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States.
Privacy
Dig Deeper
-
People who read this also read...
-
This was first published in October 2007
dimension attributes such as
last name, occupation, home address and others are likely to change from time to time. Some changes
are relatively minor. For example, if the department name changes from "marketing" to "marketing
and sales," you probably won't need to track the history of changes. However, the adjustments
history of customers' income level or income-to-debt ratio can be very significant for financial
organizations that sell consumer loans. In such cases, data warehouse architects must come up with
a way of tracking the changes.
There are multiple methods for maintaining changes in dimensions member values, but here are the
most commonly used methods:
- Overwrite the existing dimension member values. This method does not provide a way of tracking
the history of changes. This method is commonly referred to as "type 1 SCD."
- Create a new row with the current dimension members. Tag the existing row as expired. This
method is commonly referred to as "type 2 SCD."
-
Here's how the Slowly Changing Dimensions Wizard works in SSIS and what it can do for a data
warehouse architect:
- Before you run the SCD transformation wizard, you should configure your data source and
destination. For this example, I used a SQL Server relational database as the data source and a
data warehouse (which also happens to be a SQL Server database) as the destination. You could also
use the SCD transformation for data sources and destinations other than SQL Server.
- SSIS projects are normally developed through Business Intelligence Development Studio (BIDS).
First, create an SSIS project (File→New→Project, and choose Integration Services Project template),
then add a data flow task to the control flow. Activate the data flow tab, then identify your data
source and destination within BIDS. Next drag the SCD transformation to the Data Flow designer and
drag the green arrow, which denotes output from your data source, to the SCD transformation. Then
double-click SCD transformation to activate the SCD wizard.
- The initial screen simply welcomes you and informs you of the Slowly Changing Dimensions
wizard's capabilities in SQL Server Integration Services 2005. You can choose to omit this screen
in the future. The next screen allows you to pick one or multiple business keys. A business key
uniquely identifies each record in the table that is used to populate your dimension table.
Business keys are presumed to be static – once the key is assigned to the record it should not
change. Each dimension table could have one or multiple business keys.
Typically your dimension will also have a surrogate key – a column not found in the data source,
but rather added to the dimension table during extraction, transformation and loading (ETL)
process. Typically, surrogate keys are implemented as identity columns – these columns have no
business meaning, but they uniquely identify each dimension record.
- Once you have identified the business key it's time to choose the columns (referred to as
attributes) you wish to maintain within your SCD. The wizard gives you three options:
-
- Fixed attribute – values in this column should not change. If SSIS detects a
change in the value of a column tagged as a fixed attribute, it could raise an exception and fail
the execution of the package. You'll get an option to configure your package as such on the next
screen. Detecting changes in values of a fixed attribute isn't really part of slowly changing
dimensions implementation, but this functionality is useful in case you want to identify problems
in your source data while your SSIS package is executing.
- Changing attribute – is a type 1 SCD; the values of a changing attribute are simply
overwritten. History of changes is not recorded.
- Historical attribute – this is type 2 SCD. The modified value is saved in a new record
and the existing record is tagged as expired.
For example, in Figure 1 (Slowly Changing Dimension Columns), the screenshot tags
AccountDescription column as a changing attribute, AccountType as a historical attribute and
Operator as a fixed attribute.
Figure 1. Change types: fixed attribute, changing attribute and historical attribute.
In Figure 2 (Historical Attribute Options), you can configure details of your implementation.
You can advise SSIS to fail the package when the changes in fixed attributes are detected, and you
can change all matching records, including the expired records when
modifications are detected in the changing attribute. See my brief discussion earlier regarding the
first option.
The second option is useful for attributes that will have duplicate values repeated across
multiple dimension members. For example, the same account type will apply to multiple accounts. If
the Account Type is modified from "Liabilities" to "Current Liabilities," you might wish to apply
this change to all accounts that have this type.
- Next, the wizard allows you to choose how you want to identify current and expired records. You
have two options: Tag the record as expired (or obsolete, outdated or another adjective of your
choosing) or identify the expired records by adding the date of expiry to such records. If you use
the latter option, you can use one of SSIS global variables to determine the date and time value
for updating the expired dimension record. I prefer using a flag to identify the current record.
All other records are considered expired. However, you can choose from a couple of options.
-
Figure 2. Configuring details of your implementation.
- The screen in Figure 3 (Data Flow Task) allows you to configure support for inferred dimension
members. Inferred members are created when you load a record into the fact table and it has no
corresponding dimension record. For correct analyses, each fact table record must be associated
with a record in a dimension table.
-
There are multiple ways of handling this in a data warehouse. The SCD wizard allows you to
create an inferred member record with all dimension attributes set to NULL. It is often more
appropriate to have a special "unknown" member in your dimension (perhaps with the dimension
surrogate key equal to -999 or another value that stands out) as opposed to creating inferred
members. In this example we're only concerned with type 1 and type 2 SCD and not with the inferred
members. So let's uncheck the default selection and click next.
That is all the information the wizard needs in order to come up with the entire data flow
required to maintain the SCD. As you can see from the screenshot, the wizard does quite a bit of
work for you:
Figure 3. Use the Data Flow Task to configure support for inferred dimension members.
- You can customize the data flow the wizard generated to suit your application's needs. But
first, let's examine the newly created transformations: The OLE DB command transformation updates
the record_expired attribute if AccountType value has been altered. This transformation receives
input from the Derived Column transformation that determines the value to insert into the
record_expired column. If the value in the data source changes, SSIS creates a new record so the
Union All transformation combines this record with the existing data source records. The OLE DB
Command 1 transformation updates the value of AccountDescription – the changing attribute based on
the business key (AccountCodeAlternateKey) with a command similar to the following:
UPDATE [dbo].[DimAccount] SET [AccountDescription] = ?
WHERE [AccountCodeAlternateKey] = ?
Finally, the Insert Destination transformation populates the destination table DimAccount.
- Now, let's update a few members in the transactional database and then run the package to see
if the SCD wizard maintains the changing dimensions correctly. Indeed we can confirm that changing
AccountType attribute value from "assets" to "wonderful assets" triggers the creation of a new
record in the account dimension and tags the newly created record as current:
| |
| AccountCodeAlternateKey |
AccountDescription |
AccountType |
Operator |
record_expired |
| 1120 |
Receivables |
Wonderful Assets |
+ |
Current |
| 1120 |
Receivables |
Assets |
+ |
Null |
Changes in AccountDescription column, on the other hand, do not create a new row – they simply
overwrite the existing value:
| |
| AccountCodeAlternateKey |
AccountDescription |
AccountType |
Operator |
record_expired |
| 10 |
Active Asset |
Assets |
+ |
Null |
As I mentioned at the beginning of this tip, there are other methods for maintaining slowly
changing dimensions in your data warehouse that are not available through the wizard. However, the
ability to implement type 1 and type 2 slowly changing dimensions by virtue of a few clicks within
SSIS certainly could speed up ETL development.
ABOUT THE AUTHOR
Baya Pavliashvili is a database consultant helping his customers develop highly
available and scalable applications with SQL Server and Analysis Services. Throughout his career he
has managed database administrator teams and databases of terabyte caliber. Baya's primary areas of
expertise are performance tuning, replication and data warehousing. He can be reached at baya@bayasqlconsulting.com.
Disclaimer:
Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.
Join the conversationComment
Share
Comments
Results
Contribute to the conversation