Using the Mining Accuracy Chart

Once data models are processed, you need to be able to explore, query and compare them. Learn how to use Analysis Services to understand your data mining efforts.

Once you have created and processed your models, you need to be able to explore, query, and compare them so that you can understand and apply the information they provide.

Understanding the Model Viewers

Each algorithm provided with Analysis Services for data mining has its own associated viewers. Detailed descriptions of how you can use each viewer to interpret the models you create are described in each algorithm chapter. However, the viewers have some common functionality that is better described outside the context of a specific algorithm.

The data mining Viewer pane provides a drop-down control that allows you to select which model you want to view. When you select a model, it is loaded into an algorithm-specific viewer. All of the viewers provided allow you to view multiple aspects of your models, which are indicated by tabs on the top of the viewer.

The actual views come in two basic types — diagrams and tables. Each diagram view has the basic zoom and size to fit buttons on their embedded toolbars. Copying a bitmap of the entire diagram or just the displayed portion is supported through the toolbar or the pop-up menu that appears when you click the right mouse button. Additionally, there are some special mouse handling abilities available for all diagram views. Rolling the mouse wheel will cause the diagram to zoom in and out, and pressing the mouse wheel like a button will bring up a mini-navigator, as shown in Figure 3.19; this will allow you to quickly and easily move to any portion of your view.

Tabular views support a variety of features. Every tabular view supports Copy functionality that copies the table contents in HTML format so that they can be pasted into Word, Excel, FrontPage, or any other application that supports HTML. Headers in many views contain informative tooltips and can be clicked to sort the view by the information in that column. Columns can be resized by dragging the edges between column headers. Some views also support the rearranging of columns by dragging and dropping column headers. You can also view the rowset form of any model by setting the Viewer control to Microsoft Mining Content Viewer. This is the raw data form of the model content. The Viewer control is also used if you install custom visualizations provided by third parties for the algorithms.

Figure 3.19 The mini-navigator window in the Dependency Network view

TIP: If you don't like the colors the viewers use to display graphs, you can always change them. Selecting the Options item from the Tools menu brings up the Visual Studio Options Dialog. Drilling down in the tree control to Business Intelligence Designers/Analysis Services Designers/Data Mining Viewers provides a panel where you can customize the color of pretty much any aspect of any data mining chart.

Changing the color will not affect currently open visualizations. Close the mining viewer and reopen it, or switch to a different model, to notice the change.

Many viewers show statistics about the currently selected item in the Mining Legend. The Mining Legend is a dockable window that automatically appears when a viewer requiring it is displayed.

Using the Mining Accuracy Chart

The Mining Accuracy Chart pane provides tools to help gauge the quality and accuracy of the models you create. The accuracy chart performs predictions against your model and compares the result to data for which you already know the answer. The profit chart performs the same task, but allows you to specify some cost and revenue information to determine the exact point of maximum return. The classification matrix (also known as a confusion matrix) shows you exactly how many times the algorithm predicts results correctly, and what it predicts when it is wrong. In practice, it is a better to hold some data aside when you train your models, to use for testing. Using the same data for testing that you trained your models with may make the model seem to perform better than it actually does.

To use the accuracy chart, you need to select source tables from your DSV or other data sources and bind them to your mining structure. If the columns from the tables have the same name, this step is done automatically upon table selection. Once you have selected the case and nested tables and performed the binding, you can optionally filter the cases — this can be done when you have a specific column that indicates if a case is for training or testing, or simply to verify how the model performs for certain populations: for example, does the model perform differently for customers over 40? Last, you choose which target you are testing and, optionally, the value you are testing for. By default, the accuracy chart selects the same column and value for each model in the structure. However, you can also test different columns at the same time. This is useful, for instance, if you have different discretizations in different models; you might want to see how well predicting Age with five buckets compares to doing the same with seven buckets.

The type of chart you receive depends on whether the target you chose is continuous or discrete, and whether or not you chose a target value to predict. The latter case is the most common, so we will explain that first. When you select a discrete target and specify a target value, you receive a standard lift chart. A standard lift chart always contains a single line for each model you have selected, plus two additional lines, an ideal line and a random line. The coordinates at each point along the line indicate what percentage of the target audience you would capture if you used that model against the specified percentage of the audience.

For example, in Figure 3.9 the top line shows that an ideal model would capture 100% of the target using 36% of the data. This simply implies that 36% of the data indicates the desired target — there is no magic here. The bottom line is the random line. The random line is always a 45-degree line across the chart. This indicates that if you were to randomly guess the result for each case, you would capture 50% of the target using 50% of the data — again, no magic here. The other lines on the chart represent your mining models. Hopefully, all your models will be above the random line. When you see a model's line hovering around the random guess line, this means that there wasn't sufficient information in the training data to learn patterns about the target. In the model in Figure 3.20, both models are about equal, and we can get about 90% of our target using only 50% of our data. In other words, if we had $5,000 to hold a mailing campaign, each mailing cost $10, and we had 1,000 customers on our list, we could get 90% of all the possible respondents using the model, whereas we would only get 50% if we randomly sent the mailings. Note that this does not mean that 90% of the people we send to will respond. We have already determined that only 36% of the population matches the target. Using the model, we will get 90% of that 36%, or 32.4% of the total population to respond. Randomly guessing would net us only 18% of the total.

bBr> Figure 3.20 Standard lift chart

By changing the chart type to Profit Chart, you get a much better idea about the quality of the model. This chart prompts you to enter the initial cost, cost per item, and revenue per successful return, and plots a chart of the profits you will receive, using the models you've created. This can help you decide which model to use and how many people to send mail to. For example, from the chart in Figure 3.21, you can tell that if you only had enough money to mail to less than 25% of the population, you should go with the Movie Bayes model. If you had enough to go up to about 37%, the Movie Trees model would be your best bet. Most importantly, it tells you that regardless of how much money you can spend, you will maximize your profits by sending mail to about 50% of the population, using the Movie Bayes model. Additionally, it tells you how to determine the people to send to. Clicking on the chart causes a vertical line to appear with statistics about each line at the point displayed in the Mining Legend. In this case, the model says that sending a mailing to everyone with a propensity to buy of 10.51% or better using the Movie Bayes model will maximize your profit.

Another type of accuracy chart is produced when you select a discrete target variable and you do not specify which value of the target you are looking for. In this case, you get a modified lift chart that looks a bit like an upsidedown standard chart. This chart shows the overall performance of the model across all possible target states. In this version, a line coordinate indicates how many guesses you would have gotten correct had you used that model. The ideal line here is at a 45-degree angle, indicating that if you had used 50% of the data, you would have been correct about 50% of the population, or if you had used 100% of the data, you would have been correct all the time. The random guess line is based on the most likely state discovered in the training set. For example, if you were predicting gender and 57% of your training data was female, you could presume that you would get the best results by guessing female for every case. The random guess line will end at the percentage of the target that was equal to the most likely state in the training set. That is, for our gender example, if the testing set had the same distribution as our training set, the line would end at 57%, but if only 30% of the testing set was female, the line would end at 30%.

The Classification Matrix tab shows you how many times a model made a correct prediction and what answers were given when the answers were wrong. This can be important in cases where there is a cost associated with each wrong decision. For example, if you were to predict which class of member card to assign to a customer based on his or her creditworthiness, it would be less costly to misclassify someone who should have received a bronze card as a normal card than it would to issue that person a platinum card. This view simply shows you a matrix per model, illustrating counts of each pairwise combination of actual value and predicted values.

The last type of accuracy chart is strictly for continuous values. This chart is a scatter plot, comparing actual values versus predicted values for each case. In a perfect model, each point would end up on a perfect 45-degree angle, indicating that the predicted values exactly matched the actuals. On any other model, the closer the points fall to the 45-degree line, the better.

Figure 3.22 shows a scatter accuracy plot. You can see that this model performed well for most cases, with only one point that was significantly off. The scatter accuracy plot is automatically displayed instead of the lift chart when a continuous target is selected.

Figure 3.21 Profit chart with legend

Figure 3.22 Scatter accuracy plot

Creating a Lift Chart on MovieClick

Here, we will create a lift chart targeting those customers who go to the theater weekly.

1. Switch to the Mining Accuracy Chart pane by clicking the Mining Accuracy Chart icon.

2. Click Select case table on the Select Input Tables window in the Column Mapping pane.

3. Select the Homeowners table in the dialog box that appears, and click OK.

Note: In practice, you should select a table that has data held out from training. The source table is being used only to exemplify use of the control.

4. In the lower part of the pane, choose the column Theater Freq in the Predictable Column Name column.

5. In the Predict Value column, choose Weekly.

6. Click the Lift Chart tab on the top of the pane to switch to the Chart view. At this point a query is sent to the server and a chart similar to that in Figure 3.20 is displayed.

Note: You may find cases where a model provides significant lift, yet rarely or possibly never classifies your specified target correctly. This is because the standard lift chart doesn't actually care if the model predicts correctly. The lift chart sorts the predictions by the highest probability that the prediction hits the target. If the maximum probability for the target in the model is 25%, then the model may never actually predict the target. The plot of the lift chart consists of the number of targeted cases that were captured by that ordering. Since the result of the lift and profit chart is simply a probability threshold indicating where you should stop considering customers, it doesn't actually matter if the final prediction was actually correct.

Click here to return to the complete list of book excerpts from Chapter 3, 'Using SQL Server 2005 data mining,' from the book Data Mining with SQL Server 2005.

Dig Deeper on Microsoft SQL Server 2005