Example of Model Comparison

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

This section provides an example of using the Model Comparison platform. The example uses demographic data to build a model for median home price. A regression model and a bootstrap forest model are compared.

Begin by selecting Help > Sample Data Library and opening Boston Housing.jmp.

Create a Validation Column

1. Create a column called validation.

2. On the Column Info window, select Random from the Initialize Data list.

3. Select the Random Indicator radio button.

4. Click OK.

The rows assigned a 0 are the training set. The rows assigned a 1 are the validation set.

Create the Regression Model and Save the Prediction Formula to a Column

1. Select Analyze > Fit Model.

2. Select mvalue and click Y.

3. Select the other columns (except validation) and click Add.

4. Select Stepwise in the Personality list.

5. Select validation and click Validation.

6. Click the Run button.

7. Select P-value Threshold from the Stopping Rule list.

8. Click the Go button.

9. Click the Run Model button.

The Fit Group report appears, a portion of which is shown in Figure 10.2.

10. Save the prediction formula to a column by selecting Save Columns > Prediction Formula on the Response red triangle menu.

Figure 10.2 Fit Model Report

Create the Bootstrap Forest Model and Save the Prediction Formula to a Column

1. Select Analyze > Predictive Modeling > Partition.

2. Select mvalue and click Y, Response.

3. Select the other columns (except validation) and click X, Factor.

4. Select validation and click Validation.

5. Select Bootstrap Forest in the Method list.

6. Click OK.

7. Select the Early Stopping check box.

8. Select the Multiple Fits over number of terms check box.

9. Click OK.

The Bootstrap Forest report appears, a portion of which is shown in Figure 10.3.

10. Save the prediction formula to a column by selecting Save Columns > Save Prediction Formula on the Bootstrap Forest red triangle menu.

Figure 10.3 Bootstrap Forest Model

Compare the Models

1. Select Analyze > Predictive Modeling > Model Comparison.

2. Select the two prediction formula columns and click Y, Predictors.

3. Select validation and click Group.

4. Click OK.

The Model Comparison report appears (Figure 10.4).

Note: Your results differ due to the random assignment of training and validation rows.

Figure 10.4 Model Comparison Report

The rows in the training set were used to build the models, so the RSquare statistics for Validation=0 might be artificially inflated. In this case, the statistics are not representative of the models’ future predictive ability. This is especially true for the bootstrap forest model.

Compare the models using the statistics for Validation=1. In this case, the bootstrap forest model predicts better than the regression model.

Related Information

• The Model Specification chapter in the Fitting Linear Models book

• “Partition Models” chapter

Launch the Model Comparison Platform

To launch the Model Comparison platform, select Analyze > Predictive Modeling > Model Comparison.

Figure 10.5 The Model Comparison Launch Window

Y, Predictors

The columns that contain the predictions for the models that you want to compare. They can be either formula columns or just data columns. Prediction formula columns created by JMP platforms have either the Predicting or Response Probability column property. If you specify a column that does not contain one of these properties, the platform prompts you to specify which column is being predicted by the specified Y column.

For a categorical response with k levels, most model fitting platforms save k columns to the data table, each predicting the probability for a level. All k columns need to be specified as Y, Predictors. For platforms that do not save k columns of probabilities, the column containing the predicted response level can be specified as a Y, Predictors column.

If you do not specify any Y, Predictors columns, JMP uses the prediction formula columns in the data table that have either the Predicting or Response Probability column property.

Group

The column that separates the data into groups, which are fit separately.

The other role buttons are common among JMP platforms. See the Get Started chapter in the Using JMP book for details.

The Model Comparison Report

Figure 10.6 shows an example of the initial Model Comparison report for a continuous response.

Figure 10.6 Initial Model Comparison Report

The Predictors report shows all responses and all models being compared for each response. The fitting platform that created the predictor column is also listed.

The Measures of Fit report shows measures of fit for each model. The columns are different for continuous and categorical responses.

Measures of Fit for Continuous Responses

RSquare

The r-squared statistic. In data tables that contain no missing values, the r-squared statistics in the Model Comparison report and original models match. However, if there are any missing values, the r-squared statistics differ.

RASE

The square root of the mean squared prediction error. This is computed as follows:

‒ Square and sum the prediction errors (differences between the actual responses and the predicted responses) to obtain the SSE.

‒ Denote the number of observations by n.

‒ RASE is:

AAE

The average absolute error.

Freq

The column that contains frequency counts for each row.

Measures of Fit for Categorical Responses

Entropy RSquare

One minus the ratio of the negative log-likelihoods from the fitted model and the constant probability model. It ranges from 0 to 1.

Generalized RSquare

A measure that can be applied to general regression models. It is based on the likelihood function L and is scaled to have a maximum value of 1. The value is 1 for a perfect model, and 0 for a model no better than a constant model. The Generalized RSquare measure simplifies to the traditional RSquare for continuous normal responses in the standard least squares setting. Generalized RSquare is also known as the Nagelkerke or Craig and Uhler R2, which is a normalized version of Cox and Snell’s pseudo R2. See Nagelkerke (1991).

Mean -Log p

The average of -log(p), where p is the fitted probability associated with the event that occurred.

RMSE

The root mean square error, adjusted for degrees of freedom. For categorical responses, the differences are between 1 and p (the fitted probability for the response level that actually occurred).

Mean Abs Dev

The average of the absolute values of the differences between the response and the predicted response. For categorical responses, the differences are between 1 and p (the fitted probability for the response level that actually occurred).

Misclassification Rate

The rate for which the response category with the highest fitted probability is not the observed category.

The number of observations.

Related Information

“Training and Validation Measures of Fit” in the “Neural Networks” chapter provides more information about measures of fit for categorical responses.

Model Comparison Platform Options

Some options in the Model Comparison red triangle menu depend on your data.

Continuous and Categorical Responses

Model Averaging

Makes a new column of the arithmetic mean of the predicted values (for continuous responses) or the predicted.probabilities (for categorical responses).

Continuous Responses

Plot Actual by Predicted

Shows a scatterplot of the actual versus the predicted values. The plots for the different models are overlaid.

Plot Residual by Row

Shows a plot of the residuals by row number. The plots for the different models are overlaid.

Profiler

Shows a profiler for each response based on prediction formula columns in your data. The profilers have a row for each model being compared.

Categorical Responses

ROC Curve

Shows ROC curves for each level of the response variable. The curves for the different models are overlaid.

AUC Comparison

Provides a comparison of the area under the ROC curve (AUC) from each model. The area under the curve is the indicator of the goodness of fit, with 1 being a perfect fit.

The report includes the following information:

‒ standard errors and confidence intervals for each AUC

‒ standard errors, confidence intervals, and hypothesis tests for the difference between each pair of AUCs

‒ an overall hypothesis test for testing whether all AUCs are equal

Lift Curve

Shows lift curves for each level of the response variable. The curves for the different models are overlaid.

Cum Gains Curve

Shows cumulative gains curves for each level of the response variable. A cumulative gains curve is a plot of the proportion of a response level that is identified by the model against the proportion of all responses. A cumulative gains curve for a perfect model would reach 1.0 at the overall proportion of the response level. The curves for the different models are overlaid.

Confusion Matrix

Shows confusion matrices for each model. A confusion matrix is a two-way classification of actual and predicted responses. Count and rate confusion matrices are shown. Separate confusion matrices are produced for each level of the Group variable.

If the response has a Profit Matrix column property, then Actual by Decision Count and Actual by Decision Rate matrices are shown to the right of the confusion matrices. For details about these matrices, see “Additional Examples of Partitioning” in the “Partition Models” chapter.

Profiler

Shows a profiler for each response based on prediction formula columns in your data. The profilers have a row for each model being compared.

Related Information

• “ROC Curve” in the “Partition Models” chapter

• “Lift Curve” in the “Partition Models” chapter

Additional Example of Model Comparison

This example uses automobile data to build a model to predict the size of the purchased car. A logistic regression model and a decision tree model are compared.

Begin by selecting Help > Sample Data Library and opening Car Physical Data.jmp.

Create the Logistic Regression Model

1. Select Analyze > Fit Model.

2. Select Type and click Y.

3. Select the following columns and click Add: Country, Weight, Turning Cycle, Displacement, and Horsepower.

4. Click Run.

The Nominal Logistic Fit report appears.

5. Save the prediction formulas to columns by selecting Save Probability Formula from the Nominal Logistic red triangle menu.

Create the Decision Tree Model and Save the Prediction Formula to a Column

1. Select Analyze > Predictive Modeling > Partition.

2. Select Type and click Y, Response.

3. Select the Country, Weight, Turning Cycle, Displacement, and Horsepower columns and click X, Factor.

4. Make sure that Decision Tree is selected in the Method list.

5. Click OK.

The Partition report appears.

6. Click Split 10 times.

7. Save the prediction formulas to columns by selecting Save Columns > Save Prediction Formula from the Partition red triangle menu.

Compare the Models

1. Select Analyze > Predictive Modeling > Model Comparison.

2. Select all columns that begin with Prob and click Y, Predictors.

3. Click OK.

The Model Comparison report appears (Figure 10.7).

Figure 10.7 Initial Model Comparison Report

The report shows that the Partition model has slightly higher values for Entropy RSquare and Generalized RSquare and a slightly lower value for Misclassification Rate.

4. Select ROC Curve from the Model Comparison red triangle menu.

ROC curves appear for each Type, one of which is shown in Figure 10.8.

Figure 10.8 ROC Curve for Medium

Examining all the ROC curves, you see that the two models are similar in their predictive ability.

5. Select AUC Comparison from the Model Comparison red triangle menu.

AUC Comparison reports appear for each Type, one of which is shown in Figure 10.9.

Figure 10.9 AUC Comparison for Medium

The report shows results for a hypothesis test for the difference between the AUC values (area under the ROC curve). Examining the results, you see there is no statistical difference between the values for any level of Type.

You conclude that there is no large difference between the predictive abilities of the two models for the following reasons:

• The R Square values and the ROC curves are similar.

• There is no statistically significant difference between AUC values.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Example of Model Comparison

Create new playlist

Sign In

Sign Up

Table of Contents for
Example of Model Comparison