Manage Learn to apply best practices and optimize your operations.

Creating data mining objects: Data sources

To perform data mining, you must indicate and describe your source data and then create mining structures and models.

After you open your database or project, the operations performed inside an Analysis Services project are similar regardless of the operational mode you chose. To perform data mining, you need to indicate and describe your source data and then create mining structures and models.

Setting Up Your Data Sources

Two objects in Analysis Services act as interfaces to your data, the Data Source and the data source view (DSV). The Data Source is essentially a connection string indicating data location, whereas the DSV is an abstraction layer that enables you to modify the way you look at data sources, or even define a schema and switch the actual source at a later time. In this section, we talk about how to set up these objects for use in data mining.

Data Source

A data source is a rather simple object. It consists of nothing more than a connection string, plus some additional information indicating how to connect. However two aspects of data sources can easily trip you up and cause general frustration if you don't understand them. Both issues will manifest by allowing you to create objects on the client that will fail to work properly when deployed.

The first issue is data location. Unlike most data mining products, SQL Server Data Mining is a server based solution. This means that when you set up your data sources, the data source has to be accessible not only to the client where you used the tools to build the model, but also to the server where the model will be processed. For example, if you built your model based on an Access database in C:My DocumentsNorthwind.mdb, your model would fail to process unless the file was located on the same place on the server. Even if the file happened to be present, this would be a bad thing, as you would have no way of knowing if such a file actually contained the same data or even the same schema as your local version. In general, when mining on local data, you should move the data to a SQL Server database using SQL Server Integration Services (SSIS) before building your models. The second issue is security. It is important to understand the user credentials that are used to access data from Analysis Services. When you set up your connection, you generally choose to use integrated security, that is, use account credentials to access the data, or simple security, using a specified username and password. Microsoft recommends always using integrated security if supported by the source database. Regardless of the method the source database supports, a data source object contains additional information telling Analysis Services exactly which security credentials to use, or impersonate, when connecting to the database. A data source object can be created with four different impersonation options:

1. Impersonate Current User

This method in general is the most secure for data sources accessed through query statements. It causes Analysis Services to use the current user credentials to access the remote data. However, in many implementations, Analysis Services has to delegate credentials to the source database machine. For delegation to work correctly it must be configured by your domain administrators. Consult the Windows Server documentation regarding Kerberos for more information on this topic. This method cannot be used in data sources that are used for processing Analysis Services objects, because it can result in different data being accessed, depending on who processes an object.

2. Impersonate Account

When Impersonate Current User isn't an option, Impersonate Account is the second best choice. Impersonate Account allows you to specify the account credentials that will be used to access the data source. The credentials, consisting of a username and password, are stored with Analysis Services, and all access to that data uses those credentials. This method is the most secure if delegation is not an option.

3. Impersonate Service Account

Impersonate Service Account causes all data access to occur under the account which Analysis Services itself is running. This method exists mainly for testing purposes and is discouraged for production use. In general, Analysis Services should be running under an account with the most limited privileges as possible.

4. Default

Default causes different credentials to be used, depending on how the data source is accessed. If accessed for processing, the service account credentials will be used to access data. If accessed for querying, the current user credentials will be used, if possible.

Creating the MovieClick Data Source

To create a data source to the MovieClick database:

1. Download the MovieClick database and install it as described in Appendix A.
2. Right-click the Data Sources folder in the Solution or Object Explorer and select New Data Source to launch the Data Source Wizard.
3. Skip the introductory page and click the New Connection button on the second page to launch the Data Link dialog box.
4. Enter the Server name where you installed the database, specify your security settings, and select the MovieClick database.
5. Click OK to exit the Data Link dialog box; click Next to advance the wizard to the Impersonation Information page.
6. Enter the username and password you want Analysis Services to use to access the MovieClick database and click Finish to close the wizard.

Using the data source view

The data source view (DSV) is an abstract client-side view of your data. This is where your modeling begins. The DSV is where you select, organize, explore, and, in a sense, manipulate the data in the source. In essence, the DSV tells Analysis Services how you want to see the data on the source. Since the object exists on your Analysis Services server and not your relational source, you can perform such manipulations even if you only have read access to the relational server.

When creating a DSV for data mining purposes, the most important table to identify is your case table. This is the table that contains the cases you want to analyze. Additionally, you need to bring in any related tables, such as nested or lookup tables, which provide additional information about your cases.

Creating the MovieClick Data Source View

To create a DSV from the MovieClick data source:

1. Right-click the Data Source Views folder in the Solution or Object Explorer and select New Data Source View to launch the Data Source View Wizard.
2. Skip the introductory page and the MovieClick data source on the Select Data Source wizard page will be selected; click Next to continue.
3. The Name Matching page performs no function for the data source as imported, so click Next to continue. This page appears only when relationships are not specified in the source database and allows you to choose how to automatically create relationships based on the key columns of the tables. Since the MovieClick database, as imported, has neither relationships nor keys, the algorithm it uses does not work.
4. On the Select Tables and Views wizard page click the >> button to move all tables from the Available Objects list to the Included objects list.
5. Click Next and then Finish to exit the DSV Wizard and display the DSV Designer.
6. Drag a relationship to the CustomerID column in the Customers table from the CustomerID table in each of the other tables. After each connection you can right-click in the designer and select Arrange Tables to make the tables easier to see. When finished, your DSV should look like Figure 3.5.

The DSV Designer initially displays a diagram of the tables in your data source and the relationships between them. If you already know your data, and it is in the proper shape for mining, you can begin creating your mining models at this point. However, if this is not true, which is generally the case, you can use the DSV Designer to explore the data and alter it to the shape you need for your models. Initially, you can simply annotate the tables and columns in your schema to make them easier to understand and more supportable. You can add descriptions to these objects and even change the names to be something more readable and understandable. For example, if you had a table named NWSFYO3 you could rename it to Fiscal Year 03 Northwest Sales or rename column 014 to Quantity. You do this by selecting the object and typing the description and the friendly name in the Properties window. Additionally, any relationships between tables that were not specified in the source database can be indicated here simply by dragging from the foreign key column in one table to the primary key column in another. All of these modifications occur only in the DSV and do not affect your original data in any way.

Note: If the relationship between a case table and a nested table does not exist in the database, you must specify it in the DSV or you will not be able add the nested table to your model.

On top of annotating your schema, the DSV allows you to create named calculations, Named Queries, and views and interactively explore your data using pivot charts.

Click here to return to the complete list of book excerpts from Chapter 3, 'Using SQL Server 2005 data mining,' from the book Data Mining with SQL Server 2005.

Dig Deeper on SQL Server Business Intelligence Strategies

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.