Bootstrap Forest Platform Overview

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

The Bootstrap Forest platform predicts a response value by averaging the predicted response values across many decision trees. Each tree is grown on a bootstrap sample of the training data. A bootstrap sample is a random sample of observations, drawn with replacement. In addition, the predictors are sampled at each split in the decision tree. The decision tree is fit using the recursive partitioning methodology described in the “Partition Models”chapter.

The fitting process for the training set proceeds as follows:

1. For each tree, select a bootstrap sample of observations.

2. Fit the individual decision tree, using recursive partitioning, as follows:

‒ Select a random set of predictors for each split.

‒ Continue splitting until a stopping rule that is specified in the Bootstrap Forest Specification window is met.

3. Repeat step 1 and step 2 until the number of trees specified in the Bootstrap Forest Specification window is reached or until Early Stopping occurs.

For an individual tree, the bootstrap sample of observations that is used to fit the tree is drawn with replacement. You can specify the proportion of observations to be sampled. If you specify that 100% of the observations are to be sampled, because they are drawn with replacement, the expected proportion of unused observations is 1/e, or approximately 36.8%. For each individual tree, these unused observations are called the out-of-bag observations. The observations used in fitting the tree are called in-bag observations. For a continuous response, the Bootstrap Forest platform provides measures for the error rate for out-of-bag observations, called out-of-bag error.

For a continuous response, the predicted value for an observation is the average of its predicted values over the collection of individual trees. For a categorical response, the predicted probability for an observation is the average of its predicted probabilities over the collection of individual trees. The observation is classified into the level for which its predicted probability is the highest.

For more information about bootstrap forests, see Hastie et al. (2009).

Example of Bootstrap Forest with a Categorical Response

In this example, you construct a bootstrap forest model to predict whether a customer is a bad credit risk. But you are aware that your data set contains missing values, so you also explore the degree to which values are missing.

Bootstrap Forest Model

1. Select Help > Sample Data Library and open Equity.jmp.

2. Select Analyze > Predictive Modeling > Bootstrap Forest.

3. Select BAD and click Y, Response.

4. Select LOAN through DEBTINC and click X, Factor.

5. Select Validation and click Validation.

6. Click OK.

7. Next to Maximum Splits per Tree, enter 30.

8. Select Multiple Fits over Number of Terms and enter 5 next to Max Number of Terms.

9. (Optional) Select Suppress Multithreading and enter 123 next to Random Seed.

Because the bootstrap forest method involves random sampling, these actions ensure that you will obtain the exact results shown below.

10. Click OK.

Figure 6.2 Overall Statistics Report

Because the Multiple Fits over Number of Terms option was specified, models were created using 3, 4, and 5 as the number of predictors in each split. The Model Validation-Set Summaries report shows that the model whose Validation set has the highest Entropy RSquare is the five-term model. This is also the model with the smallest misclassification rate. This model is determined to be the best model, and the results in the Overall report are for this model.

The Overall report shows that the misclassification rates for the Validation and Test sets are about 11.3% and 9.9%, respectively. The confusion matrices suggest that the largest source of misclassification is the classification of bad risk customers as good risks.

The results for the Test set give you an indication of how well your model extends to independent observations. The Validation set was used in selecting the Bootstrap Forest model. For this reason, the results for the Validation set give a biased indication of how the model generalizes to independent data.

You are interested in determining which predictors contributed the most to your model.

11. Click the red triangle next to Bootstrap Forest for BAD and select Column Contributions.

Figure 6.3 Column Contributions Report

The Column Contributions report suggests that the strongest predictor of a customer’s credit risk is DEBTINC, which is the debt to income ratio. The next highest contributors to the model are DELINQ, the number of delinquent credit lines, and VALUE, the assessed value of the customer.

Missing Values

Next, you explore the extent to which predictor values are missing.

1. Select Analyze > Screening > Explore Missing Values.

2. Select Bad through DEBTINC and click Y, Columns.

3. Click OK in the Alert that appears.

The columns REASON and JOB are not added to the Y, Columns list because they have a Character data type. You can see how many values are missing for these two columns using Distribution (not illustrated in this example).

4. Click OK.

Figure 6.4 Missing Values Report

The DEBTINC column contains 1267 missing values, which amounts to about 21% of the observations. Most other columns involved in the Bootstrap Forest analysis also contain missing values. The Informative Missing option in the launch window ensures that the missing values are treated in a way that acknowledges any information that they carry. For details, see “Informative Missing” in the “Partition Models” chapter.

Example of Bootstrap Forest with a Continuous Response

In this example, you construct a bootstrap forest model to predict the percent body fat for male subjects.

1. Select Help > Sample Data Library and open Body Fat.jmp.

2. Select Analyze > Predictive Modeling > Bootstrap Forest.

3. Select Percent body fat and click Y, Response.

4. Select Age (years) through Wrist circumference (cm) and click X, Factor.

5. Select Validation and click Validation.

6. Click OK.

7. (Optional) Select Suppress Multithreading and enter 123 next to Random Seed.

Because the bootstrap forest method involves random sampling, these actions ensure that you will obtain the exact results shown below.

8. Click OK.

Figure 6.5 Overall Statistics

The Overall Statistics report shows that the Validation RSquare is 0.673.

You are interested in obtaining a model-independent indication of the most important predictors.

9. Click the red triangle next to Bootstrap Forest for Percent body fat and select Column Contributions.

Figure 6.6 Column Contributions

The Column Contributions report suggests that Abdomen circumference (cm), Weight (cm), and Chest circumference (cm) are the strongest predictors for Percent body fat.

Launch the Bootstrap Forest Platform

Launch the Bootstrap Forest platform by selecting Analyze > Predictive Modeling > Bootstrap Forest.

Launch Window

Figure 6.7 Bootstrap Forest Launch Window

The Bootstrap Forest platform launch provides the following options:

Y, Response

The response variable or variables that you want to analyze.

X, Factor

The predictor variables.

Weight

A column whose numeric values assign a weight to each row in the analysis.

Freq

A column whose numeric values assign a frequency to each row in the analysis.

Validation

A numeric column that contains at most three distinct values. See “Validation” in the “Partition Models” chapter.

A column or columns whose levels define separate analyses. For each level of the specified column, the corresponding rows are analyzed using the other variables that you have specified. The results are presented in separate reports. If more than one By variable is assigned, a separate report is produced for each possible combination of the levels of the By variables.

Method

Enables you to select the partition method (Decision Tree, Bootstrap Forest, Boosted Tree, K Nearest Neighbors, or Naive Bayes). These alternative methods, except for Decision Tree, are available in JMP Pro.

For more details on these methods, see Chapter 5, “Partition Models”, Chapter 7, “Boosted Tree”, Chapter 8, “K Nearest Neighbors”, and Chapter 9, “Naive Bayes”.

Validation Portion

The portion of the data to be used as the validation set. See “Validation” in the “Partition Models” chapter.

Informative Missing

If selected, enables missing value categorization for categorical predictors and informative treatment of missing values for continuous predictors. See “Informative Missing” in the “Partition Models” chapter.

Ordinal Restricts Order

If selected, restricts consideration of splits to those that preserve the ordering.

Specification Window

After you select OK in the launch window, the Bootstrap Forest Specification window appears.

Figure 6.8 Bootstrap Forest Specification Window

Specification Panel

Number of Rows

The number of rows in the data table.

Number of Terms

The number of columns that are specified as predictors.

Forest Panel

Number of Trees in the Forest

Number of trees to grow and then average.

Number of Terms Sampled per Split

Number of predictors to consider as splitting candidates at each split. For each split, a new random sample of predictors is taken as the candidate set.

Bootstrap Sample Rate

Proportion of observations to sample (with replacement) for growing each tree. A new random sample is generated for each tree.

Minimum Splits Per Tree

Minimum number of splits for each tree.

Maximum Splits Per Tree

Maximum number of splits for each tree.

Minimum Size Split

Minimum number of observations needed on a candidate split.

Early Stopping

(Available only if validation is used.) If selected, the process stops growing additional trees if adding more trees does not improve the validation statistic. The validation statistic is the validation set’s Entropy RSquare value for a categorical response and its RSquare value for a continuous response. If not selected, the process continues until the specified number of trees is reached.

Multiple Fits Panel

Multiple Fits over Number of Terms

If selected, creates a bootstrap forest for several values of number of terms sampled per split. The model for which results are displayed is the model whose Validation Set’s Entropy RSquare value (for a categorical response) or RSquare (for a continuous response) is the largest.

The lower bound is the Number of Terms Sampled per Split specification. The upper bound is specified by the following option:

Max Number of Terms

The maximum number of terms to consider for a split.

Use Tuning Table Design

Opens a window where you can select a data table containing values for the Forest panel tuning parameters, called a tuning design table. A tuning design table has a column for each option that you want to specify and has one or multiple rows that each represent a single Bootstrap Forest model design. If an option is not specified in the tuning design table, the default value is used.

For each row in the table, JMP creates a Bootstrap Forest model using the tuning parameters specified. If more than one model is specified in the tuning design table, the Model Validation-Set Summaries report lists the RSquare value for each model. The Bootstrap Forest report shows the fit statistics for the model with the largest RSquare value.

You can create a tuning design table using the Design of Experiments facilities. A bootstrap forest tuning design table can contain the following case-insensitive columns in any order:

‒ Number Trees

‒ Number Terms

‒ Portion Bootstrap

‒ Minimum Splits per Tree

‒ Maximum Splits per Tree

‒ Minimum Size Split

Reproducibility Panel

Suppress Multithreading

If selected, all calculations are performed on a single thread.

Random Seed

Specify a nonzero numeric random seed to reproduce the results for future launches of the platform. By default, the Random Seed is set to zero, which does not produce reproducible results. When you save the analysis to a script, the random seed that you enter is saved to the script.

The Bootstrap Forest Report

After you click OK in the Bootstrap Forest Specification window, the Bootstrap Forest report appears.

Figure 6.9 Bootstrap Forest Report for a Categorical Response

Figure 6.10 Bootstrap Forest Report for a Continuous Response

The following reports are provided, depending on whether the response is categorical or continuous:

• “Model Validation-Set Summaries”

• “Specifications”

• “Overall Statistics”

• “Cumulative Validation”

• “Per-Tree Summaries”

Model Validation-Set Summaries

(Available when you select the Multiple Fits over Number of Terms option in Bootstrap Forest Specification window.) Provides fit statistics for all the models fit. See Figure 6.9 and “Multiple Fits Panel”.

Specifications

Shows the settings used in fitting the model.

Overall Statistics

Provides fit statistics for the training set, and for the validation and test sets if they are specified. The specific form of the report depends on the modeling type of the response.

Suppose that multiple models are fit using the Multiple Fits over Multiple Terms option in the Bootstrap Forest Specification window. Then the model for which results are displayed in the Overall Statistics and Cumulative Validation reports is the model for which the validation set’s Entropy RSquare value (for a categorical response) or RSquare (for a continuous response) is the largest.

Categorical Response

Measures Report

Gives the following statistics for the training set, and for the validation and test sets if there are specified.

Note: For Entropy RSquare and Generalized RSquare, values closer to 1 indicate a better fit. For Mean -Log p, RMSE, Mean Abs Dev, and Misclassification Rate, smaller values indicate a better fit.

Entropy RSquare

One minus the ratio of the negative log-likelihoods from the fitted model and the constant probability model. It ranges from 0 to 1.

Generalized RSquare

A measure that can be applied to general regression models. It is based on the likelihood function L and is scaled to have a maximum value of 1. The value is 1 for a perfect model, and 0 for a model no better than a constant model. The Generalized RSquare measure simplifies to the traditional RSquare for continuous normal responses in the standard least squares setting. Generalized RSquare is also known as the Nagelkerke or Craig and Uhler R2, which is a normalized version of Cox and Snell’s pseudo R2.

Mean -Log P

The average of negative log(p), where p is the fitted probability associated with the event that occurred.

RMSE

The root mean square error, adjusted for degrees of freedom. The differences are between 1 and p, the fitted probability for the response level that actually occurred.

Mean Abs Dev

The average of the absolute values of the differences between the response and the predicted response. The differences are between 1 and p, the fitted probability for the response level that actually occurred.

Misclassification Rate

The rate for which the response category with the highest fitted probability is not the observed category.

The number of observations.

Confusion Matrix

(Available only for categorical responses.) Shows classification statistics for the training set, and for the validation and test sets if they are specified.

Decision Matrix

(Available only for categorical responses and if the response has a Profit Matrix column property or if you specify costs using the Specify Profit Matrix option.) Gives Decision Count and Decision Rate matrices for the training set, and for the validation and test sets if they are specified. See “Additional Examples of Partitioning” in the “Partition Models” chapter.

Continuous Response

Individual Trees Report

Gives RMSE values, which are averaged over all trees, for In Bag and Out of Bag observations. Training set observations that are used to construct a tree are called in-bag observations. Training observations that are not used to construct a tree are called out-of-bag (OOB) observations.

For each tree, the Out of Bag RMSE is computed as the square root of the sum of squared errors divided by the number of OOB observations. The squared Out of Bag RMSE for each tree is given in the Per-Tree Summaries report as OOB SSE/N.

RSquare and RMSE Report

Gives Rsquare, root mean square error, and the number of observations for the training set, and for the validation and test sets, if they are defined.

Cumulative Validation

(Available only if validation is used.) Shows a plot of the fit statistics for the Validation set versus the number of trees.

For a continuous response, the single fit statistic is R-Square. For a categorical response, the fit statistics are listed below and are described in “Measures Report”.

• RSquare (Entropy RSquare)

• Avg - Log p (Mean - Log p)

• RMS Error (RMSE)

• Avg Abs Error (Mean Abs Dev)

• MR (Misclassification Rate)

The Cumulative Details report below the Cumulative Validation plot gives the values used in the plot.

Per-Tree Summaries

The Per-Tree Summaries report involves the concepts of in-bag and out-of-bag observations. For an individual tree, the bootstrap sample of observations used in fitting the tree is drawn with replacement. Even if you specify that 100% of the observations are to be sampled, because they are drawn with replacement, the expected proportion of unused observations is 1/e. For each individual tree, the unused observations are called the out-of-bag observations. The observations used in fitting the tree are called in-bag observations.

The Per-Tree Summaries report shows the following summary statistics for each tree:

Splits

Number of splits in the decision tree.

Rank

Rank of the tree’s OOB Loss in ascending order. The tree with the smallest OOB loss has Rank 1.

OOB Loss

A measure of the total predictive inaccuracy of the tree when applied to the Out Of Bag rows. Lower values indicate a higher predictive accuracy.

OOB Loss/N

The OOB Loss divided by the number of OOB rows, OOB N.

RSquare

(Available only for continuous responses.) The RSquare value for the tree.

IB SSE

(Available only for continuous responses.) Sum of squared errors for the In Bag rows.

IB SSE/N

(Available only for continuous responses.) Sum of squared errors for the In Bag rows divided by the number of In Bag observations. The number of In Bag observations is equal to the number of observations in the training set multiplied by the bootstrap sampling rate that you specify in the Bootstrap Forest Specification window.

OOB N

(Available only for continuous responses.) Number of Out Of Bag rows.

OOB SSE

(Available only for continuous responses.) Sum of squared errors when the tree is applied to the Out Of Bag rows.

OOB SSE/N

(Available only for continuous responses.) The OOB SSE divided by the number of OOB rows, OOB N.

Bootstrap Forest Platform Options

The Bootstrap Forest report red triangle menu has the following options:

Plot Actual by Predicted

(Available only for continuous responses.) Provides a plot of actual versus predicted values.

Column Contributions

Displays a report that shows each input column’s contribution to the fit. The report also shows:

‒ The total number of splits defined by a column.

‒ The total G2 (for a categorical response) or SS, sum of squares (for a continuous response), attributed to the column.

‒ A bar chart of G2 or SS.

‒ The proportion of G2 or SS attributed to the column.

Show Trees

Provides various options for displaying trees in the Tree Views report. The report gives a picture of the tree that is fit at each layer of the boosting process. For a description of the Prob column shown by the Show names categories estimates option, see “Predicted Probabilities in Decision Tree and Bootstrap Forest” in the “Partition Models” chapter.

ROC Curve

(Available only for categorical responses.) See “ROC Curve” in the “Partition Models” chapter.

Lift Curve

(Available only for categorical responses.) See “Lift Curve” in the “Partition Models” chapter.

Save Columns

Contains options for saving model and tree results, and creating SAS code.

Save Predicteds

Saves the predicted values from the model to the data table.

Save Prediction Formula

Saves the prediction formula to a column in the data table. The formula consists of nested conditional clauses that describe the tree structure. If the response is continuous, the column contains a Predicting property. If the response is categorical, the column contains a Response Probability property.

Save Tolerant Prediction Formula

(The Save Prediction Formula option should be used instead of this option. Use this option only when Publish Prediction Formula is not available.) Saves a formula that predicts even when there are missing values and when Informative Missing has not been selected. The prediction formula tolerates missing values by randomly allocating response values for missing predictors to a split. If the response is continuous, the column contains a Predicting property. If the response is categorical, the column contains a Response Probability property. If you have selected Informative Missing, you can save the Tolerant Prediction Formula by holding the Shift key as you click on the report’s red triangle.

Save Residuals

(Available only for continuous responses.) Saves the residuals to the data table.

Save Cumulative Details

(Available only if validation is used.) Creates a data table containing the fit statistics for each tree.

Publish Prediction Formula

Creates a prediction formula and saves it as a formula column script in the Formula Depot platform. If a Formula Depot report is not open, this option creates a Formula Depot report. See the “Formula Depot” chapter.

Publish Tolerant Prediction Formula

(The Publish Prediction Formula option should be used instead of this option. Use this option only when Publish Prediction Formula is not available.) Creates a tolerant prediction formula and saves it as a formula column script in the Formula Depot platform. If a Formula Depot report is not open, this option creates a Formula Depot report. See the “Formula Depot” chapter. If you have selected Informative Missing, you can use this option by holding the Shift key as you click on the report’s red triangle.

Make SAS DATA Step

Creates SAS code for scoring a new data set.

Specify Profit Matrix

(Available only for categorical responses.) Enables you to specify profit or costs associated with correct or incorrect classification decisions. See “Show Fit Details” in the “Partition Models” chapter.

Profiler

Shows a Prediction Profiler. For more information, see the Profiler chapter in the Profilers book.

See the JMP Reports chapter in the Using JMP book for more information about the following options:

Local Data Filter

Shows or hides the local data filter that enables you to filter the data used in a specific report.

Redo

Contains options that enable you to repeat or relaunch the analysis. In platforms that support the feature, the Automatic Recalc option immediately reflects the changes that you make to the data table in the corresponding report window.

Save Script

Contains options that enable you to save a script that reproduces the report to several destinations.

Save By-Group Script

Contains options that enable you to save a script that reproduces the platform report for all levels of a By variable to several destinations. Available only when a By variable is specified in the launch window.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Bootstrap Forest Platform Overview

Create new playlist

Sign In

Sign Up

Table of Contents for
Bootstrap Forest Platform Overview