
Once you have organized, modified, selected, and understand the data you
want to analyze, you can start to create data mining objects. The first step is to
run the Data Mining Wizard. After completing it, you can refine the results in
the Data Mining Designer.
Structures and Models
SQL Server Analysis Services has two major objects that deal with data mining:
mining structures and mining models.
A mining structure defines the domain of a mining problem, whereas a mining
model is the application of a mining algorithm to the structure in a mining
structure. A mining structure contains a list of structure columns that have data
and content types, bindings to the data source, plus some optional flags that
control how the data is modeled. Additionally, a mining structure contains a
list of mining models that use the columns from the structure.
The definition of a mining model contains an algorithm with its associated
parameters, plus a list of columns from the mining structure. Each model in a
structure can use a different algorithm or the same algorithm with different
parameters, and/or a different subset of the columns in the structure. For each
column in the model, you can assign how it is to be used in that model and
algorithm-specific modeling flags. This feature allows you to easily test different
hypotheses on the same data set.
Using the Data Mining Wizard
The Data Mining Wizard creates two objects for you: the mining structure that describes the columns and training data you will use for mining, and a mining model, which takes those columns, applies an algorithm, and defines the usage of each column for that algorithm. The wizard wraps the creation of these two objects into one simple set of steps.
The steps of the wizard are to select your algorithm, select the source tables
and specify how they are used, select the columns from those tables and specify
how they are used, name the model, and you're done. At that point, you can
process and analyze the results of your model without further adieu. Analysis
Services makes it that simple to get started. The wizard also allows you to create
models from multidimensional, that is, OLAP, sources. This topic is covered
in Chapter 11, so we will focus only on relational sources for the time being.
Using the wizard is simple because it performs several steps automatically,
based on the input you provide. As a data miner, it is important that you
understand these steps and how and when decisions that impact your model
are made.
On the first page in the wizard, you choose whether you are creating a model
from a relational or multidimensional source, as shown in Figure 3.10. Although
in the end a model created from one source appears identical to those created
from another, the creation process is slightly different, so there are different wizard
paths for each option. Also, a particular mining algorithm may not support
creating models from OLAP sources, so this question is asked first.
The next page asks you which algorithm to use to create your initial mining
model. The list of algorithms is determined by the capabilities of your target
server and may contain more or less than the list of algorithms covered in this
book. The reasons for and process by which this occurs are described in Chapter
13. If you cannot connect to a server at the time the wizard is run, you get
the default list of algorithms provided with SQL Server Data Mining, as shown
in Figure 3.11. Choosing which algorithm you are going to use is dependant on
the business problem you are trying to solve. The application of each algorithm
is described in its respective chapter.
On the next two pages, you indicate the data you will be mining. You choose
the DSV containing the tables, then you specify the actual tables themselves.
When choosing the tables, you have to specify whether each table is the case table
or if it is a nested table, as shown in Figure 3.12. As described in Chapter 2, the case
table is the case that contains the entities you want to analyze, and a nested table
contains additional, usually transactional, information about each case.
Tip: Sometimes determining which table is the case table can be a bit
confusing. For example, if you want to analyze how products are purchased
together, you may naïvely choose products as the case table. However, you are
actually analyzing the groups of those products that were purchased by a single
customer. In this case, the customer becomes the case with the transaction
table containing the product purchases as a nested table.
When you have only a transaction table, the table can be used as both the case
table and the nested table by specifying the transaction ID as the case-level key
and the other columns as columns in the nested table.

Figure 3.10 Select Method screen of the Data Mining Wizard

Figure 3.11 Select Data Mining Technique page of the Data Mining Wizard

Figure 3.12 Specifying table types in the Data Mining Wizard
On the next two pages, you indicate which columns you are using plus how
you want the mining algorithms to interpret each one. First, you specify which
columns are used in the model, plus whether they are key, input, and/or predictable.
Then you specify the data and content types for each of the columns.
You must specify a key for the case table and each nested table in your
model, as shown in Figure 3.13. Remember that the key of a nested table in
DMX is not the foreign key that relates the nested table to the case table, rather
it is the key in the context that you have it nested. The wizard enforces this
relationship by not presenting the foreign key as a choice and warning the user
if a key is not specified. For example, a nested table representing a customer's
shopping cart comes from a table that may have a row ID as a key, plus transaction
ID, product name, quantity, and price. The nested table in our model
would only have the product, quantity, and price columns, because the row ID
isn't of interest in our model, and the transaction ID is the foreign key to the
case table. In this reduced context, you can see that the quantity and price
relate to the product, which becomes the key of the nested table. Sequence clustering
and time series models have special rules regarding the specification of
keys. See Chapters 8 and 10, respectively, for specific details.
TIP: One thing to consider when determining the correct column to be a
nested key is that data mining finds patterns by examining similarities and
differences between cases. If you chose a column as a nested key such that the
values in that column would only show up in a single case, the data mining
algorithms would find no patterns relative to that column. This logic summarily
dismisses the use of transaction IDs or row IDs as nested keys.

Figure 3.13 Indicating column usages in the Data Mining Wizard, showing the specification of nested keys
Which columns you specify as input and which as predictable depends on
your business problem, the hypothesis you are trying to test, and the algorithm
you chose. In general, specifying a column as Input indicates that the
algorithm will use that column to determine the columns marked as Predictable,
or an output. The exact way that each algorithm uses this information
varies somewhat, so you should familiarize yourself with the specific semantics
detailed in each algorithm chapter. One fact that remains constant among
all algorithms is that if you want to be able to select a column from the model
in a PREDICTION JOIN statement, the column must be predictable. To predict
a nested table, check the box in the Predict column next to its key.
TIP: If you have many columns in your table, it can be difficult to know which
to choose as inputs. You can always use all the columns, but this involves
additional processing power and, depending on the algorithm, may make your
model difficult to interpret.
The Suggest button on the Specify Column Usage page of the wizard performs a
quick entropy-based analysis to indicate which columns are likely to provide
information toward a selected output, thereby reducing the number of columns
in your final model. Note that this feature only considers case-level columns in
its analysis and is not a guarantee that the selected columns will impact or that
the nonselected columns will not impact your target variable.
Next, you are presented with the list of columns you have chosen and their
respective data and content types, as shown in Figure 3.14. Indicating the correct
content type is crucial to the performance and accuracy of your model. If
you had a field such as Income marked as DISCRETE, for instance, the algorithm
would assume that each possible income value was a distinct category
and would likely spend extra processing power to learn absolutely nothing.
On the flip side, if you had a categorical column where the categories were
indicated by integers (for example, 1–Blue, 2–Yellow, 3–Red, 4–Green, and so
on) marked as CONTINUOUS, the algorithm would assume that it could compare
them and measure distances between points, in this case creating the
bizarre logic that Green(4) – Red(3) = Blue(1)! Luckily, the Data Mining Wizard
has the ability to automatically detect whether a numeric column is categorical
(discrete) or continuous. Clicking the Detect button on this page causes the
wizard to sample and analyze the source data and choose an appropriate
content type. If a continuous type is determined and your selected algorithm
does not support continuous columns, the content type will be specified as
DISCRETIZED. You can set discretization parameters in the designer, as specified
in the next section. Before moving on with the wizard, you should verify
that the content types were assigned correctly and modify any that were not.
The final page of the wizard, shown in Figure 3.15, allows you to specify the
names of the structure and model and enable the drill-through feature if it is
supported by the algorithm. When completed, the wizard creates a mining
structure containing a mining model and launches the Data Mining Designer.

Figure 3.14 Specifying content and data types in the Data Mining Wizard

Figure 3.15 Naming objects in the Data Mining Wizard
Creating the MovieClick Mining Structure and Model
We will use the Mining Model Wizard to create a mining structure and model
to predict the number of bedrooms for customers who own their homes, based
on all of the information we know about them.
1. Right-click the Mining Models folder in the Solution Explorer or the
Object Explorer, and choose New Mining Model.
2. Skip the description page and go to the Select the Definition Method
page. As we are building our model from relational data, leave the
default option selected and click Next.
3. On the Select the Data Mining Technique page, you can view the list of
available algorithms if you are connected to your server. Leave Microsoft
Decision Trees selected, and click Next.
4. On the Select Data Source View page, you are presented with the single
DSV in this project or database. Click Next to continue.
5. On the Specify Table Types page, we need to indicate our case and
nested tables. Since we only want to analyze homeowners, click the
check box in the Case column next to the Homeowners named query.
We want to add all of the nested tables, so click the check box in the
Nested column next to the Actors, Channels, Criteria, Directors, Hobbies,
Movies, and Technology tables. Click Next to continue.
6. On the Specify the Training Data page, we indicate which columns we
are going to use in our model and how they are to be used. Since the
key column for the Homeowners table was indicated in the DSV, it is
already selected as the key of the model. We are predicting Bedrooms,
so click the check box in the Predictable column next to the named calculation
Bedrooms.
7. We want all other case-level columns to be inputs except for Num Bedrooms
from which Bedrooms is derived. Check the check box in the
Input column next to all case columns except Customer Id and Num
Bedrooms. Note — you can use the SHIFT key to check multiple
columns at the same time.
8. For each nested table, we need to indicate the nested key. The wizard
automatically filters out the foreign key column, so each nested table is
left with a single column. Click the check box in the Key column for
each column in each nested table. Check the input column as well. Click
Next to continue.
9. On the Specify Columns' Content and Data Type page, click the Detect
button to automatically assign the correct content types. Click Next to
continue.
10. On the completion page of the wizard, type MovieClick as the name of
the structure and Movie Trees as the name of the model, then check the
Allow Drill through box and click Finish to end the wizard.
Upon completing the wizard, the Data Mining Designer opens to the Mining
Structure Editor.
Click here to return to the complete list of book excerpts from Chapter 3, 'Using SQL Server 2005 data mining,' from the book Data Mining with SQL Server 2005.