Once you have created and processed your models, you need to be able to
explore, query, and compare them so that you can understand and apply the
information they provide.
Understanding the Model Viewers
Each algorithm provided with Analysis Services for data mining has its own
associated viewers. Detailed descriptions of how you can use each viewer to
interpret the models you create are described in each algorithm chapter. However,
the viewers have some common functionality that is better described outside
the context of a specific algorithm.
The data mining Viewer pane provides a drop-down control that allows you
to select which model you want to view. When you select a model, it is loaded
into an algorithm-specific viewer. All of the viewers provided allow you to
view multiple aspects of your models, which are indicated by tabs on the top
of the viewer.
The actual views come in two basic types — diagrams and tables. Each diagram
view has the basic zoom and size to fit buttons on their embedded toolbars.
Copying a bitmap of the entire diagram or just the displayed portion is
supported through the toolbar or the pop-up menu that appears when you
click the right mouse button. Additionally, there are some special mouse handling
abilities available for all diagram views. Rolling the mouse wheel will
cause the diagram to zoom in and out, and pressing the mouse wheel like a
button will bring up a mini-navigator, as shown in Figure 3.19; this will allow
you to quickly and easily move to any portion of your view.
Tabular views support a variety of features. Every tabular view supports
Copy functionality that copies the table contents in HTML format so that they
can be pasted into Word, Excel, FrontPage, or any other application that supports
HTML. Headers in many views contain informative tooltips and can be
clicked to sort the view by the information in that column. Columns can be
resized by dragging the edges between column headers. Some views also support
the rearranging of columns by dragging and dropping column headers.
You can also view the rowset form of any model by setting the Viewer control
to Microsoft Mining Content Viewer. This is the raw data form of the
model content. The Viewer control is also used if you install custom visualizations
provided by third parties for the algorithms.

Figure 3.19 The mini-navigator window in the Dependency Network view
TIP: If you don't like the colors the viewers use to display graphs, you can
always change them. Selecting the Options item from the Tools menu brings up
the Visual Studio Options Dialog. Drilling down in the tree control to Business
Intelligence Designers/Analysis Services Designers/Data Mining Viewers
provides a panel where you can customize the color of pretty much any aspect
of any data mining chart.
Changing the color will not affect currently open visualizations. Close the mining
viewer and reopen it, or switch to a different model, to notice the change.
Many viewers show statistics about the currently selected item in the Mining
Legend. The Mining Legend is a dockable window that automatically
appears when a viewer requiring it is displayed.
Using the Mining Accuracy Chart
The Mining Accuracy Chart pane provides tools to help gauge the quality and
accuracy of the models you create. The accuracy chart performs predictions
against your model and compares the result to data for which you already
know the answer. The profit chart performs the same task, but allows you to
specify some cost and revenue information to determine the exact point of
maximum return. The classification matrix (also known as a confusion matrix)
shows you exactly how many times the algorithm predicts results correctly,
and what it predicts when it is wrong. In practice, it is a better to hold some
data aside when you train your models, to use for testing. Using the same data
for testing that you trained your models with may make the model seem to
perform better than it actually does.
To use the accuracy chart, you need to select source tables from your DSV or
other data sources and bind them to your mining structure. If the columns
from the tables have the same name, this step is done automatically upon table
selection. Once you have selected the case and nested tables and performed
the binding, you can optionally filter the cases — this can be done when you
have a specific column that indicates if a case is for training or testing, or simply
to verify how the model performs for certain populations: for example,
does the model perform differently for customers over 40? Last, you choose
which target you are testing and, optionally, the value you are testing for. By
default, the accuracy chart selects the same column and value for each model
in the structure. However, you can also test different columns at the same time.
This is useful, for instance, if you have different discretizations in different
models; you might want to see how well predicting Age with five buckets
compares to doing the same with seven buckets.
The type of chart you receive depends on whether the target you chose is
continuous or discrete, and whether or not you chose a target value to predict.
The latter case is the most common, so we will explain that first. When you
select a discrete target and specify a target value, you receive a standard lift
chart. A standard lift chart always contains a single line for each model you
have selected, plus two additional lines, an ideal line and a random line. The
coordinates at each point along the line indicate what percentage of the target
audience you would capture if you used that model against the specified percentage
of the audience.
For example, in Figure 3.9 the top line shows that an ideal model would capture
100% of the target using 36% of the data. This simply implies that 36% of
the data indicates the desired target — there is no magic here. The bottom
line is the random line. The random line is always a 45-degree line across the
chart. This indicates that if you were to randomly guess the result for each
case, you would capture 50% of the target using 50% of the data — again, no
magic here. The other lines on the chart represent your mining models. Hopefully,
all your models will be above the random line. When you see a model's
line hovering around the random guess line, this means that there wasn't sufficient
information in the training data to learn patterns about the target. In the
model in Figure 3.20, both models are about equal, and we can get about 90%
of our target using only 50% of our data. In other words, if we had $5,000 to
hold a mailing campaign, each mailing cost $10, and we had 1,000 customers
on our list, we could get 90% of all the possible respondents using the model,
whereas we would only get 50% if we randomly sent the mailings. Note that
this does not mean that 90% of the people we send to will respond. We have
already determined that only 36% of the population matches the target. Using
the model, we will get 90% of that 36%, or 32.4% of the total population to
respond. Randomly guessing would net us only 18% of the total.
bBr>
Figure 3.20 Standard lift chart
By changing the chart type to Profit Chart, you get a much better idea about
the quality of the model. This chart prompts you to enter the initial cost, cost
per item, and revenue per successful return, and plots a chart of the profits you
will receive, using the models you've created. This can help you decide which
model to use and how many people to send mail to. For example, from the
chart in Figure 3.21, you can tell that if you only had enough money to mail to
less than 25% of the population, you should go with the Movie Bayes model. If
you had enough to go up to about 37%, the Movie Trees model would be your
best bet. Most importantly, it tells you that regardless of how much money you
can spend, you will maximize your profits by sending mail to about 50% of the
population, using the Movie Bayes model. Additionally, it tells you how to
determine the people to send to. Clicking on the chart causes a vertical line to
appear with statistics about each line at the point displayed in the Mining Legend.
In this case, the model says that sending a mailing to everyone with a
propensity to buy of 10.51% or better using the Movie Bayes model will maximize
your profit.
Another type of accuracy chart is produced when you select a discrete target
variable and you do not specify which value of the target you are looking
for. In this case, you get a modified lift chart that looks a bit like an upsidedown
standard chart. This chart shows the overall performance of the model
across all possible target states. In this version, a line coordinate indicates how
many guesses you would have gotten correct had you used that model. The
ideal line here is at a 45-degree angle, indicating that if you had used 50% of
the data, you would have been correct about 50% of the population, or if you
had used 100% of the data, you would have been correct all the time. The random
guess line is based on the most likely state discovered in the training set.
For example, if you were predicting gender and 57% of your training data was
female, you could presume that you would get the best results by guessing
female for every case. The random guess line will end at the percentage of the
target that was equal to the most likely state in the training set. That is, for our
gender example, if the testing set had the same distribution as our training set,
the line would end at 57%, but if only 30% of the testing set was female, the
line would end at 30%.
The Classification Matrix tab shows you how many times a model made a
correct prediction and what answers were given when the answers were wrong.
This can be important in cases where there is a cost associated with each wrong
decision. For example, if you were to predict which class of member card to
assign to a customer based on his or her creditworthiness, it would be less costly
to misclassify someone who should have received a bronze card as a normal
card than it would to issue that person a platinum card. This view simply shows you a matrix per model, illustrating counts of each pairwise combination of
actual value and predicted values.
The last type of accuracy chart is strictly for continuous values. This chart is
a scatter plot, comparing actual values versus predicted values for each case.
In a perfect model, each point would end up on a perfect 45-degree angle, indicating
that the predicted values exactly matched the actuals. On any other
model, the closer the points fall to the 45-degree line, the better.
Figure 3.22 shows a scatter accuracy plot. You can see that this model performed
well for most cases, with only one point that was significantly off. The
scatter accuracy plot is automatically displayed instead of the lift chart when a
continuous target is selected.

Figure 3.21 Profit chart with legend

Figure 3.22 Scatter accuracy plot
Creating a Lift Chart on MovieClick
Here, we will create a lift chart targeting those customers who go to the theater
weekly.
1. Switch to the Mining Accuracy Chart pane by clicking the Mining
Accuracy Chart icon.
2. Click Select case table on the Select Input Tables window in the Column
Mapping pane.
3. Select the Homeowners table in the dialog box that appears, and click OK.
Note: In practice, you should select a table that has data held out from
training. The source table is being used only to exemplify use of the control.
4. In the lower part of the pane, choose the column Theater Freq in the
Predictable Column Name column.
5. In the Predict Value column, choose Weekly.
6. Click the Lift Chart tab on the top of the pane to switch to the Chart view.
At this point a query is sent to the server and a chart similar to that in Figure
3.20 is displayed.
Note: You may find cases where a model provides significant lift, yet rarely or
possibly never classifies your specified target correctly. This is because the
standard lift chart doesn't actually care if the model predicts correctly. The lift
chart sorts the predictions by the highest probability that the prediction hits the
target. If the maximum probability for the target in the model is 25%, then the
model may never actually predict the target. The plot of the lift chart consists of
the number of targeted cases that were captured by that ordering. Since the
result of the lift and profit chart is simply a probability threshold indicating
where you should stop considering customers, it doesn't actually matter if the
final prediction was actually correct.
Click here to return to the complete list of book excerpts from Chapter 3, 'Using SQL Server 2005 data mining,' from the book Data Mining with SQL Server 2005.