Once you have organized, modified, selected, and understand the data you want to analyze, you can start to create data mining objects. The first step is to run the Data Mining Wizard. After completing it, you can refine the results in the Data Mining Designer.
Structures and Models
SQL Server Analysis Services has two major objects that deal with data mining: mining structures and mining models.
A mining structure defines the domain of a mining problem, whereas a mining model is the application of a mining algorithm to the structure in a mining structure. A mining structure contains a list of structure columns that have data and content types, bindings to the data source, plus some optional flags that control how the data is modeled. Additionally, a mining structure contains a list of mining models that use the columns from the structure.
The definition of a mining model contains an algorithm with its associated parameters, plus a list of columns from the mining structure. Each model in a structure can use a different algorithm or the same algorithm with different parameters, and/or a different subset of the columns in the structure. For each column in the model, you can assign how it is to be used in that model and algorithm-specific modeling flags. This feature allows you to easily test different hypotheses on the same data set.
Using the Data Mining Wizard
The Data Mining Wizard creates two objects for you: the mining structure that describes the columns and training data you will use for mining, and a mining model, which takes those columns, applies an algorithm, and defines the usage of each column for that algorithm. The wizard wraps the creation of these two objects into one simple set of steps.
The steps of the wizard are to select your algorithm, select the source tables and specify how they are used, select the columns from those tables and specify how they are used, name the model, and you're done. At that point, you can process and analyze the results of your model without further adieu. Analysis Services makes it that simple to get started. The wizard also allows you to create models from multidimensional, that is, OLAP, sources. This topic is covered in Chapter 11, so we will focus only on relational sources for the time being.
Using the wizard is simple because it performs several steps automatically, based on the input you provide. As a data miner, it is important that you understand these steps and how and when decisions that impact your model are made.
On the first page in the wizard, you choose whether you are creating a model from a relational or multidimensional source, as shown in Figure 3.10. Although in the end a model created from one source appears identical to those created from another, the creation process is slightly different, so there are different wizard paths for each option. Also, a particular mining algorithm may not support creating models from OLAP sources, so this question is asked first.
The next page asks you which algorithm to use to create your initial mining model. The list of algorithms is determined by the capabilities of your target server and may contain more or less than the list of algorithms covered in this book. The reasons for and process by which this occurs are described in Chapter 13. If you cannot connect to a server at the time the wizard is run, you get the default list of algorithms provided with SQL Server Data Mining, as shown in Figure 3.11. Choosing which algorithm you are going to use is dependant on the business problem you are trying to solve. The application of each algorithm is described in its respective chapter.
On the next two pages, you indicate the data you will be mining. You choose the DSV containing the tables, then you specify the actual tables themselves. When choosing the tables, you have to specify whether each table is the case table or if it is a nested table, as shown in Figure 3.12. As described in Chapter 2, the case table is the case that contains the entities you want to analyze, and a nested table contains additional, usually transactional, information about each case.
Tip: Sometimes determining which table is the case table can be a bit confusing. For example, if you want to analyze how products are purchased together, you may naïvely choose products as the case table. However, you are actually analyzing the groups of those products that were purchased by a single customer. In this case, the customer becomes the case with the transaction table containing the product purchases as a nested table.
When you have only a transaction table, the table can be used as both the case table and the nested table by specifying the transaction ID as the case-level key and the other columns as columns in the nested table.
Figure 3.10 Select Method screen of the Data Mining Wizard
Figure 3.11 Select Data Mining Technique page of the Data Mining Wizard
Figure 3.12 Specifying table types in the Data Mining Wizard
On the next two pages, you indicate which columns you are using plus how you want the mining algorithms to interpret each one. First, you specify which columns are used in the model, plus whether they are key, input, and/or predictable.
Then you specify the data and content types for each of the columns. You must specify a key for the case table and each nested table in your model, as shown in Figure 3.13. Remember that the key of a nested table in DMX is not the foreign key that relates the nested table to the case table, rather it is the key in the context that you have it nested. The wizard enforces this relationship by not presenting the foreign key as a choice and warning the user if a key is not specified. For example, a nested table representing a customer's shopping cart comes from a table that may have a row ID as a key, plus transaction ID, product name, quantity, and price. The nested table in our model would only have the product, quantity, and price columns, because the row ID isn't of interest in our model, and the transaction ID is the foreign key to the case table. In this reduced context, you can see that the quantity and price relate to the product, which becomes the key of the nested table. Sequence clustering and time series models have special rules regarding the specification of keys. See Chapters 8 and 10, respectively, for specific details.
TIP: One thing to consider when determining the correct column to be a nested key is that data mining finds patterns by examining similarities and differences between cases. If you chose a column as a nested key such that the values in that column would only show up in a single case, the data mining algorithms would find no patterns relative to that column. This logic summarily dismisses the use of transaction IDs or row IDs as nested keys.
Figure 3.13 Indicating column usages in the Data Mining Wizard, showing the specification of nested keys
Which columns you specify as input and which as predictable depends on your business problem, the hypothesis you are trying to test, and the algorithm you chose. In general, specifying a column as Input indicates that the algorithm will use that column to determine the columns marked as Predictable, or an output. The exact way that each algorithm uses this information varies somewhat, so you should familiarize yourself with the specific semantics detailed in each algorithm chapter. One fact that remains constant among all algorithms is that if you want to be able to select a column from the model in a PREDICTION JOIN statement, the column must be predictable. To predict a nested table, check the box in the Predict column next to its key.
TIP: If you have many columns in your table, it can be difficult to know which to choose as inputs. You can always use all the columns, but this involves additional processing power and, depending on the algorithm, may make your model difficult to interpret.
The Suggest button on the Specify Column Usage page of the wizard performs a quick entropy-based analysis to indicate which columns are likely to provide information toward a selected output, thereby reducing the number of columns in your final model. Note that this feature only considers case-level columns in its analysis and is not a guarantee that the selected columns will impact or that the nonselected columns will not impact your target variable.
Next, you are presented with the list of columns you have chosen and their respective data and content types, as shown in Figure 3.14. Indicating the correct content type is crucial to the performance and accuracy of your model. If you had a field such as Income marked as DISCRETE, for instance, the algorithm would assume that each possible income value was a distinct category and would likely spend extra processing power to learn absolutely nothing. On the flip side, if you had a categorical column where the categories were indicated by integers (for example, 1–Blue, 2–Yellow, 3–Red, 4–Green, and so on) marked as CONTINUOUS, the algorithm would assume that it could compare them and measure distances between points, in this case creating the bizarre logic that Green(4) – Red(3) = Blue(1)! Luckily, the Data Mining Wizard has the ability to automatically detect whether a numeric column is categorical (discrete) or continuous. Clicking the Detect button on this page causes the wizard to sample and analyze the source data and choose an appropriate content type. If a continuous type is determined and your selected algorithm does not support continuous columns, the content type will be specified as DISCRETIZED. You can set discretization parameters in the designer, as specified in the next section. Before moving on with the wizard, you should verify that the content types were assigned correctly and modify any that were not.
The final page of the wizard, shown in Figure 3.15, allows you to specify the names of the structure and model and enable the drill-through feature if it is supported by the algorithm. When completed, the wizard creates a mining structure containing a mining model and launches the Data Mining Designer.
Figure 3.14 Specifying content and data types in the Data Mining Wizard
Figure 3.15 Naming objects in the Data Mining Wizard
Creating the MovieClick Mining Structure and Model
We will use the Mining Model Wizard to create a mining structure and model to predict the number of bedrooms for customers who own their homes, based on all of the information we know about them.
1. Right-click the Mining Models folder in the Solution Explorer or the Object Explorer, and choose New Mining Model.
2. Skip the description page and go to the Select the Definition Method page. As we are building our model from relational data, leave the default option selected and click Next.
3. On the Select the Data Mining Technique page, you can view the list of available algorithms if you are connected to your server. Leave Microsoft Decision Trees selected, and click Next.
4. On the Select Data Source View page, you are presented with the single DSV in this project or database. Click Next to continue.
5. On the Specify Table Types page, we need to indicate our case and nested tables. Since we only want to analyze homeowners, click the check box in the Case column next to the Homeowners named query. We want to add all of the nested tables, so click the check box in the Nested column next to the Actors, Channels, Criteria, Directors, Hobbies, Movies, and Technology tables. Click Next to continue.
6. On the Specify the Training Data page, we indicate which columns we are going to use in our model and how they are to be used. Since the key column for the Homeowners table was indicated in the DSV, it is already selected as the key of the model. We are predicting Bedrooms, so click the check box in the Predictable column next to the named calculation Bedrooms.
7. We want all other case-level columns to be inputs except for Num Bedrooms from which Bedrooms is derived. Check the check box in the Input column next to all case columns except Customer Id and Num Bedrooms. Note — you can use the SHIFT key to check multiple columns at the same time.
8. For each nested table, we need to indicate the nested key. The wizard automatically filters out the foreign key column, so each nested table is left with a single column. Click the check box in the Key column for each column in each nested table. Check the input column as well. Click Next to continue.
9. On the Specify Columns' Content and Data Type page, click the Detect button to automatically assign the correct content types. Click Next to continue.
10. On the completion page of the wizard, type MovieClick as the name of the structure and Movie Trees as the name of the model, then check the Allow Drill through box and click Finish to end the wizard.
Upon completing the wizard, the Data Mining Designer opens to the Mining Structure Editor.
Click here to return to the complete list of book excerpts from Chapter 3, 'Using SQL Server 2005 data mining,' from the book Data Mining with SQL Server 2005.