Named calculations are additional virtual columns on the tables in your DSV. This allows you to mine derived information in your data without having to change your source data. Anamed calculation consists of name, a SQL expression containing the calculation, and an optional description. The calculation can be any valid SQL expression. If you are not a SQL expert, here are some types of expressions that are useful in many data mining projects.
Figure 3.5 Completed MovieClick data source view
Standard SQL supports +, -, * , /, and % (modulo). For example you could create a Profit named calculation with the following formula:
[Sales Price] – [Item Cost]
Mathematical functions are useful, especially when data in a column does not follow a uniform distribution. The SQL functions ABS, LOG, SIGN, and SQRT are particularly useful. Here are two examples:
- To flatten out an exponentially increasing variable:
- To create a flag for over/under budget quarters:
SIGN([Actual Expenses] – [Budgeted Expenses])
Often, the hypothesis you want to test depends on a variable that is a combination of two of the variables you already have. For example, it may not be interesting that a person is married or has children, but the combination of the two may provide valuable information. A composite expression for this situation could look like this:
[Marital Status] + ' ' + [Has Children]
CASE expressions are an extremely flexible way to create meaningful variables for data mining. The CASE expression allows you to assign results based on the evaluation of one or more conditions. Useful applications of CASE are to change value labels, manually discretize columns, reduce the number of valid states, and convert an attribute from a nested table to the case level.
- To change value labels:
CASE [Category] WHEN 1 THEN 'Food' WHEN 2 THEN 'Beverage' WHEN 3 THEN 'Goods' END CASE
- Manually discretize a column:
CASE WHEN [Age] < 20 THEN 'Under 20' WHEN [Age] <= 30 THEN 'Between 20 and 30' WHEN [Age] <= 40 THEN 'Between 30 and 40' ELSE 'Over 40' END
- To reduce the number of valid states:
CASE [Marital Status] WHEN 'Married' THEN [Marital Status] WHEN 'Never Married' THEN [Marital Status] ELSE 'Other' END
- To convert an attribute from a nested table to a case table:
CASE WHEN EXISTS (SELECT [Movie] FROM [Movies] WHERE [Movie]='Star Wars' AND [Movies].[CustomerID]=[Customers].[CustomerID]) THEN 'True' ELSE 'False' END
This would be done, for instance, when you wanted to convert a nested attribute to a case-level attribute. Note that if you still want to use the nested table in the model, you will have to use a named query to filter the attribute from the nested table, as described in the next section.
Creating a Named Calculation on the Customers Table
To create a named calculation to discretize and reduce the number of states in the Num Bedrooms column:
1. Right-click the Customers table and select Create a named calculation.
2. Enter the calculation name Bedrooms and optionally enter a description.
3. Enter the following expression:
CASE WHEN [Num Bedrooms] = 1 THEN 'One' WHEN [Num Bedrooms] <= 3 THEN 'Two or Three' WHEN [Num Bedrooms] >= 4 THEN 'Four or more' ELSE 'None' END
Upon closing the dialog box, the DSV Designer will validate your expression and return any applicable errors. Once you have successfully created your calculation you can see the results by right-clicking the table and selecting Explore Data.
Click here to return to the complete list of book excerpts from Chapter 3, 'Using SQL Server 2005 data mining,' from the book Data Mining with SQL Server 2005.