Boosted Tree Platform Overview

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

The Boosted Tree platform produces an additive decision tree model that is based on many smaller decision trees that are constructed in layers. The tree in each layer consists of a small number of splits, typically five or fewer. Each layer is fit using the recursive fitting methodology described in the “Partition Models” chapter. The only difference is that fitting stops at a specified number of splits. For a given tree, the predicted value for an observation in a leaf is the mean of all observations in that leaf.

The fitting process proceeds as follows:

1. Fit an initial layer.

2. Compute residuals. These are obtained by subtracting the predicted mean for observations within a leaf from their actual value.

3. Fit a layer to the residuals.

4. Construct the additive tree. For a given observation, sum its predicted values over the layers.

5. Repeat step 2 to step 4 until the specified number of layers is reached, or, if validation is used, until fitting an additional layer no longer improves the validation statistic.

The final prediction is the sum of the predictions for an observation over all the layers.

By fitting successive layers on residuals from previous layers, each layer can improve the fit.

For categorical responses, only those with two response levels are supported. For a categorical response, the residuals fit at each layer are offsets of linear logits. The final prediction is a logistic transformation of the sum of the linear logits over all the layers.

For more information about boosted trees, see Hastie et al. (2009).

Example of Boosted Tree with a Categorical Response

In this example, you construct a boosted tree model to predict which printing jobs are affected by a defect called banding.

6. Select Help > Sample Data and open Bands Data.jmp.

7. Select Analyze > Predictive Modeling > Boosted Tree.

8. Select Banding? and click Y, Response.

9. Select the Predictors column group and click X, Factor.

10. Enter 0.2 for Validation Portion.

11. Click OK.

The Boosted Tree Specification window appears.

12. (Optional) In the Reproducibility panel, select Suppress Multithreading and enter 123 for Random Seed.

Because the boosted tree fit involves a random component, these actions ensure that you obtain the exact results shown below.

13. Click OK.

Figure 7.2 Overall Statistics for Nominal Response

Because the response, Banding?, is categorical, the Boosted Tree analysis provides a Misclassification Rate under Measure and a Confusion Matrix report. The Misclassification Rate for the validation set is 0.2222, or about 22%.

14. Click the red triangle next to Boosted Tree for Banding? and select Show Trees > Show names categories estimates.

A Tree Views report appears, with outlines for the layers. You can examine the layers to see the trees that are fit and the predicted values.

Figure 7.3 Layer 1 of the Boosted Tree

15. Click the red triangle next to Boosted Tree for Banding? and select Save Columns > Save Prediction Formula.

Columns called Prob(Banding?==noband), Prob(Banding?==band), and Most Likely Banding? are added to the data table. Examine the Prob(Banding?==noband) column to see how model predictions are calculated from the layers.

Example of Boosted Tree with a Continuous Response

In this example, you construct a boosted tree model to predict the percent body fat given a combination of nominal and continuous factors.

1. Select Help > Sample Data and open the Body Fat.jmp sample data table.

2. Select Analyze > Predictive Modeling > Boosted Tree.

3. Select Percent body fat and click Y, Response.

4. Select Age (years) through Wrist circumference (cm) and click X, Factor.

5. Select Validation and click Validation.

6. Click OK.

7. Click OK.

Figure 7.4 Overall Statistics for Continuous Response

The Overall Statistics report provides the R-square and RMSE for the boosted tree model. The R-square for the validation set is 0.603. The RMSE for the validation set is about 5.48.

You are interested in obtaining a model-independent indication of the important predictors for Percent body fat.

8. Click the red triangle next to Boosted Tree for Percent body fat and select Profiler.

9. Click the red triangle next to Prediction Profiler and select Assess Variable Importance > Independent Uniform Inputs.

Note: Because Assess Variable Importance uses randomization, your results will not exactly match those in Figure 7.5.

Figure 7.5 Summary Report for Variable Importance

The Summary Report shows that Abdomen circumference (cm) is the most important predictor of Percent body fat.

Launch the Boosted Tree Platform

Launch the Boosted Tree platform by selecting Analyze > Predictive Modeling > Boosted Tree.

Launch Window

Figure 7.6 Boosted Tree Launch Window Using Body Fat.jmp

The Boosted Tree platform launch window has the following options:

Y, Response

The response variable or variables that you want to analyze.

X, Factor

The predictor variables.

Weight

A column whose numeric values assign a weight to each row in the analysis.

Freq

A column whose numeric values assign a frequency to each row in the analysis.

Validation

A numeric column that contains at most three distinct values. See “Validation” in the “Partition Models” chapter.

A column or columns whose levels define separate analyses. For each level of the specified column, the corresponding rows are analyzed using the other variables that you have specified. The results are presented in separate reports. If more than one By variable is assigned, a separate report is produced for each possible combination of the levels of the By variables.

Method

Enables you to select the partition method (Decision Tree, Bootstrap Forest, Boosted Tree, K Nearest Neighbors, or Naive Bayes). These alternative methods, except for Decision Tree, are available in JMP Pro.

For more details about these methods, see Chapter 5, “Partition Models”, Chapter 6, “Bootstrap Forest”, Chapter 8, “K Nearest Neighbors”, and Chapter 9, “Naive Bayes”.

Validation Portion

The portion of the data to be used as the validation set. See “Validation” in the “Partition Models” chapter.

Informative Missing

If selected, enables missing value categorization for categorical predictors and informative treatment of missing values for continuous predictors. See “Informative Missing” in the “Partition Models” chapter.

Ordinal Restricts Order

If selected, restricts consideration of splits to those that preserve the ordering.

Specification Window

After you select OK in the launch window, the Gradient-Boosted Trees Specification window appears.

Figure 7.7 Boosted Tree Specification Window

Boosting Panel

Number of Layers

Maximum number of layers to include in the final tree.

Splits per Tree

Number of splits for each layer.

Learning Rate

A number such that 0 < r ≤ 1. Learning rates close to 1 result in faster convergence on a final tree, but also have a higher tendency to overfit data. Use learning rates closer to 1 when a small Number of Layers is specified. The learning rate is a small fraction typically between 0.01 and 0.1 that slows the convergence of the model. This preserves opportunities for later layers to use different splits than the earlier layers.

Overfit Penalty

(Available only for categorical responses.) A biasing parameter that helps protect against fitting probabilities equal to zero.See “Overfit Penalty”.

Minimum Size Split

Minimum number of observations needed on a candidate split.

Multiple Fits Panel

Multiple Fits over Splits and Learning Rate

If selected, creates a boosted tree for every combination of Splits per Tree (in integer increments) and Learning Rate (in 0.1 increments).

The lower bounds for the combinations are specified by the Splits per Tree and Learning Rate options. The upper bounds for the combinations are specified by the following options:

Max Splits per Tree

Upper bound for Splits per Tree.

Max Learning Rate

Lower bound for Learning Rate.

Use Tuning Design Table

Opens a window where you can select a data table containing values for some tuning parameters, called a tuning design table.A tuning design table has a column for each option that you want to specify and has one or multiple rows that each represent a single Boosted Tree model design. If an option is not specified in the tuning design table, the default value is used.

For each row in the table, JMP creates a Boosted Tree model using the tuning parameters specified. If more than one model is specified in the tuning design table, the Model Validation-Set Summaries report lists the R-Square value for each model. The Boosted Tree report shows the fit statistics for the model with the largest R-Square value.

You can create a tuning design table using the Design of Experiments facilities. A boosted tree tuning design table can contain the following case-insensitive columns in any order:

‒ Number of Layers

‒ Splits per Tree

‒ Learning Rate

‒ Minimum Size Split

‒ Row Sampling Rate

‒ Column Sampling Rate

Stochastic Boosting Panel

Row Sampling Rate

Proportion of training rows to sample for each layer.

Note: When the response is categorical, the training rows are sampled using stratified random sampling.

Column Sampling Rate

Proportion of predictor columns to sample for each layer.

Reproducibility Panel

Suppress Multithreading

If selected, all calculations are performed on a single thread.

Random Seed

Specify a nonzero numeric random seed to reproduce the results for future launches of the platform. By default, the Random Seed is set to zero, which does not produce reproducible results. When you save the analysis to a script, the random seed that you enter is saved to the script.

Early Stopping

If selected, the boosting process stops fitting additional layers when adding more layers does not improve the validation statistic. If not selected, the boosting process continues until the specified number of layers is reached. This option appears only if validation is used.

The Boosted Tree Report

After you click OK in the Gradient-Boosted Trees Specification window, the Boosted Tree report opens.

Figure 7.8 Boosted Tree Report for a Continuous Response

Figure 7.9 Boosted Tree Report for a Categorical Response

The following reports are provided, depending on whether the response is categorical or continuous:

• “Model Validation - Set Summaries”

• “Specifications”

• “Overall Statistics”

• “Cumulative Validation”

Model Validation - Set Summaries

Shows fit statistics for all the models fit if you selected the Multiple Fits over Splits and Learning Rate option in the Specification window. See Figure 7.8 and “Multiple Fits Panel”.

Specifications

Shows the settings used in fitting the model.

Overall Statistics

Shows fit statistics for the training set, and for the validation and test sets if they are specified.

Suppose that you fit multiple models using the Multiple Fits over Multiple Terms option in the Bootstrap Forest Specification window. Then the model for which results are displayed in the Overall Statistics and Cumulative Validation reports is the model for which the validation set’s Entropy R-square value (for a categorical response) or R-square (for a continuous response) is the largest.

Measures Report

(Available only for categorical responses.) Gives the following statistics for the training set, and for the validation and test sets if there are specified.

Note: For Entropy R-Square and Generalized R-Square, values closer to 1 indicate a better fit. For Mean -Log p, RMSE, Mean Abs Dev, and Misclassification Rate, smaller values indicate a better fit.

Entropy RSquare

One minus the ratio of the negative log-likelihoods from the fitted model and the constant probability model. Entropy R-Square ranges from 0 to 1.

Generalized RSquare

A measure that can be applied to general regression models. It is based on the likelihood function L and is scaled to have a maximum value of 1. The value is 1 for a perfect model, and 0 for a model no better than a constant model. The Generalized R-Square measure simplifies to the traditional R-Square for continuous normal responses in the standard least squares setting. Generalized R-Square is also known as the Nagelkerke or Craig and Uhler R2, which is a normalized version of Cox and Snell’s pseudo R2.

Mean -Log P

The average of negative log(p), where p is the fitted probability associated with the event that occurred.

RMSE

The root mean square error, adjusted for degrees of freedom. The differences are between 1 and p, the fitted probability for the response level that actually occurred.

Mean Abs Dev

The average of the absolute values of the differences between the response and the predicted response. The differences are between 1 and p, the fitted probability for the response level that actually occurred.

Misclassification Rate

The rate for which the response category with the highest fitted probability is not the observed category.

The number of observations.

Confusion Matrix

(Available only for categorical responses.) Shows classification statistics for the training set, and for the validation and test sets if they are specified.

Decision Matrix

(Available only for categorical responses and if the response has a Profit Matrix column property or if you specify costs using the Specify Profit Matrix option.) Gives Decision Count and Decision Rate matrices for the training set, and for the validation and test sets if they are specified. See “Additional Examples of Partitioning” in the “Partition Models” chapter.

Cumulative Validation

(Available only if validation is used.) Shows a plot of the fit statistics for the Validation set versus the number of layers.

For a continuous response, the single fit statistic is R-Square. For a categorical response, the fit statistics are listed below and are described in “Measures Report”.

• R-Square (Entropy R-Square)

• Avg - Log p (Mean - Log p)

• RMS Error (RMSE)

• Avg Abs Error (Mean Abs Dev)

• MR (Misclassification Rate)

The Cumulative Details report below the Cumulative Validation plot gives the values used in the plot.

Boosted Tree Platform Options

The Boosted Tree report red-triangle menu has the following options:

Show Trees

Provides options for displaying trees in the Tree Views report. The report gives a picture of the tree that is fit at each layer of the boosting process.

Plot Actual by Predicted

(Available only for continuous responses.) Provides a plot of actual versus predicted values.

Column Contributions

Displays a report showing each input column’s contribution to the fit. The report also shows:

‒ The total number of splits defined by a column.

‒ The total G2 (for a categorical response) or SS, sum of squares (for a continuous response) attributed to the column.

‒ A bar chart of G2 or SS.

‒ The proportion of G2 or SS attributed to the column.

ROC Curve

(Available only for categorical responses.) See “ROC Curve” in the “Partition Models” chapter.

Lift Curve

(Available only for categorical responses.) See “Lift Curve” in the “Partition Models” chapter.

Save Columns

Contains options for saving model and tree results, and creating SAS code.

Save Predicteds

saves the predicted values from the model to the data table.

Save Prediction Formula

Saves the prediction formula to a column in the data table. The formula consists of nested conditional clauses that describe the tree structure. If the response is continuous, the column contains a Predicting property. If the response is categorical, the column contains a Response Probability property.

Save Tolerant Prediction Formula

(The Save Prediction Formula option should be used instead of this option. Use this option only when Save Prediction Formula is not available.) Saves a formula that predicts even when there are missing values and when Informative Missing has not been selected. The prediction formula tolerates missing values by randomly allocating response values for missing predictors to a split. If the response is continuous, the column contains a Predicting property. If the response is categorical, the column contains a Response Probability property. If you have selected Informative Missing, you can save the Tolerant Prediction Formula by holding the Shift key as you click on the report’s red triangle.

Save Residuals

(Available only for continuous responses.) Saves the residuals to the data table.

Save Offset Estimates

(Available only for categorical responses.) Saves the sums of the linear components. These are the logits of the fitted probabilities.

Save Tree Details

Creates a data table containing split details and estimates for each layer.

Save Cumulative Details

(Available only if validation is used.) Creates a data table containing the fit statistics for each layer.

Publish Prediction Formula

Creates a prediction formula and saves it as a formula column script in the Formula Depot platform. If a Formula Depot report is not open, this option creates a Formula Depot report. See the “Formula Depot” chapter.

Publish Tolerant Prediction Formula

(The Publish Prediction Formula option should be used instead of this option. Use this option only when Publish Prediction Formula is not available.) Creates a tolerant prediction formula and saves it as a formula column script in the Formula Depot platform. If a Formula Depot report is not open, this option creates a Formula Depot report. See the “Formula Depot” chapter. If you have selected Informative Missing, you can use this option by holding the Shift key as you click on the report’s red triangle.

Make SAS DATA Step

Creates SAS code for scoring a new data set.

Specify Profit Matrix

(Available only for categorical responses.) Enables you to specify profit or costs associated with correct or incorrect classification decisions. See “Show Fit Details” in the “Partition Models” chapter.

Profiler

Shows a Prediction Profiler. For more information, see the Profiler chapter in the Profilers book.

See the JMP Reports chapter in the Using JMP book for more information about the following options:

Local Data Filter

Shows or hides the local data filter that enables you to filter the data used in a specific report.

Redo

Contains options that enable you to repeat or relaunch the analysis. In platforms that support the feature, the Automatic Recalc option immediately reflects the changes that you make to the data table in the corresponding report window.

Save Script

Contains options that enable you to save a script that reproduces the report to several destinations.

Save By-Group Script

Contains options that enable you to save a script that reproduces the platform report for all levels of a By variable to several destinations. Available only when a By variable is specified in the launch window.

Statistical Details for the Boosted Tree Platform

This section describes details specific to the Boosted Tree Platform. For details about recursive decision trees, see “Statistical Details” in the “Partition Models” chapter.

Overfit Penalty

When the response is categorical, a parametric penalty is imposed. For each layer, the estimates minimize the negative log-likelihood plus the penalty value multiplied by the sum of squares of the estimates for each observation. This penalty encourages each new layer not to overfit the training data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Boosted Tree Platform Overview

Create new playlist

Sign In

Sign Up

Table of Contents for
Boosted Tree Platform Overview