Partition Platform Overview
The Partition platform recursively partitions data according to a relationship between the predictors and response values, creating a decision tree. Variations of partitioning go by many names and brand names: decision trees, CARTTM, CHAIDTM, C4.5, C5, and others. The technique is often considered as a data mining technique for the following reasons:
it is useful for exploring relationships without having a good prior model
it handles large problems easily
the results are interpretable
A classic application of partitioning is to create a diagnostic heuristic for a disease. Given symptoms and outcomes for a number of subjects, partitioning can be used to generate a hierarchy of questions to help diagnose new patients.
Predictors can be either continuous or categorical (nominal or ordinal). If a predictor is continuous, then the splits are created by a cutting value. The sample is divided into values below and above this cutting value. If a predictor is categorical, then the sample is divided into two groups of levels.
The response can also be either continuous or categorical (nominal or ordinal). If the response is continuous, then the platform fits the means of the response values. If the response is categorical, then the fitted value is a probability for the levels of the response. In either case, the split is chosen to maximize the difference in the responses between the two nodes of the split.
For more information about split criteria, see “Statistical Details”.
For more information about recursive partitioning, see Hawkins, D. M., and Kass, G. V. (1982) and Kass, G. B. (1980).
Example of the Partition Platform
In this example, you use the Partition platform to construct a decision tree that predicts the one-year disease progression (low or high) of patients with diabetes.
1. Select Help > Sample Data Library and Diabetes.jmp.
2. Select Analyze > Predictive Modeling > Partition.
3. Select Y Binary and click Y, Response.
4. Select Age through Glucose and click X, Factor.
5. Enter 0.33 for the Validation Portion.
Note: In JMP Pro, a validation column can be used for validation. Select Validation and click Validation. Set the Validation Portion to 0.
6. Click OK.
7. On the platform report window, click Go to perform automatic splitting.
Note: Because you are using a random Validation Portion, your results differ from those in Figure 5.2.
Figure 5.2 Partition Report for Diabetes
Partition Report for Diabetes
Automatic splitting resulted in four splits. The final RSquare for the Validation set is 0.154. The decision tree shows the four splits and the counts of observations in each split.
8. Click the red triangle next to Partition for Y Binary and select Column Contributions.
Figure 5.3 Column Contributions Report
Column Contributions Report
The Column Contributions report shows that LTG and BMI are the only predictors in the decision tree model. Each column is used in two splits. Your results can differ. When the Validation Portion is used, the validation set is selected at random from the data table. If you redo your analysis, a new random validation set is selected and your results can differ from your first run.
9. Click the red triangle next to Partition for Y Binary and select Save Columns > Save Prediction Formula.
In the Diabetes.jmp data table, columns called Prob(Y Binary==Low), Prob(Y Binary==High), and Most Likely Y Binary are added. To see how these response probabilities are calculated, in the Columns panel, next to each column, double-click the Formula icon Image shown here.
Launch the Partition Platform
Launch the Partition platform by selecting Analyze > Predictive Modeling > Partition.
Figure 5.4 Partition Launch Window
Partition Launch Window
Y, Response
The response variable or variables that you want to analyze.
X, Factor
The predictor variables.
Weight
A column whose numeric values assign a weight to each row in the analysis.
Freq
A column whose numeric values assign a frequency to each row in the analysis.
Image shown hereValidation
A numeric column that contains at most three distinct values. See “Validation”.
By
A column or columns whose levels define separate analyses. For each level of the column, the corresponding rows are analyzed using the other variables that you specify. The results appear in separate reports. If more than one By variable is assigned, a separate report is produced for each possible combination of the levels of the By variables.
Image shown hereMethod
Enables you to select the partition method (Bootstrap Forest, Boosted Tree, K Nearest Neighbors, or Naive Bayes).
Validation Portion
The portion of the data to be used as the validation set. See “Validation”.
Informative Missing
If selected, enables missing value categorization for categorical predictors and informative treatment of missing values for continuous predictors. See “Informative Missing”.
Ordinal Restricts Order
If selected, restricts consideration of splits to those that preserve the ordering.
The Partition Report
The initial Partition report shows a partition plot, control buttons, a summary panel, and a decision tree. The partition plot and decision tree are initialized without any splits. The reports details are different for categorical and continuous responses.
Control Buttons
Use the control buttons to interact with the decision tree.
Split
Creates a partition of the data using the optimal split. To specify multiple splits, hold the Shift key as you click Split.
Prune
Removes the most recent split.
Go
(Available when you are using validation.) Automatically adds splits to the decision tree until the validation statistic is optimized. See “Validation”. Without validation, you simply decide the number of splits to use in the partition model.
Color Points
For categorical responses, colors observations according to response level. These colors are added to the data table.
Report for Categorical Responses
The sample data table Diabetes.jmp was used to create a report for the categorical response Y Binary.
Figure 5.5 Partition Report for a Categorical Response
Partition Report for a Categorical Response
Partition Plot
Each point in the Partition Plot represents an observation in the data table. If validation is used, the plot is only for the training data. The initial partition plot does not show splits.
Notice the following:
The left vertical axis is the proportion of each response outcome.
The right vertical axis shows the order in which the response levels are plotted.
Horizontal lines divide each split by the response variable. The initial horizontal line shows the overall proportion of the first plotted response in the data set.
Splits are shown below the x-axis with a text description and a vertical line that splits the observations in the plot. The vertical lines extend into the plot and indicate the boundaries for each node. The most recent split appears directly below the horizontal axis and on top of existing splits. The plot is updated with each split or prune of the decision tree.
Summary Report
Figure 5.6 Summary Report for a Categorical Response
Summary Report for a Categorical Response
The Summary Report provides fit statistics for the training data and validation and test data (if used). The fit statistics in the Summary Panel update as you add splits or prune the decision tree.
RSquare
The current value of R2.
N
Number of observations.
Number of Splits
Current number of splits in the decision tree.
Node Reports
Each node in the tree has a report and a red triangle menu with additional options. Terminal nodes also have a Candidates report.
Figure 5.7 Terminal Node Report for a Categorical Response
Terminal Node Report for a Categorical Response
Count
Number of training observations that are characterized by the node.
G2
A fit statistic used for categorical responses (instead of sum of squares that is used for continuous responses). Lower values indicate a better fit. See “Statistical Details”.
Candidates
For each column, the Candidates report provides details about the optimal split for that column. The optimal split over all terms is marked with an asterisk.
Term
Shows the candidate columns.
Candidate G^2
Likelihood ratio chi-square for the best split. Splitting on the predictor with the largest G^2 maximizes the reduction in the model G^2.
LogWorth
The LogWorth statistic, defined as -log10(p-value). The optimal split is the one that maximizes the LogWorth. See “Statistical Details” for additional details.
Cut Point
The value of the predictor that determines the split. For a categorical term, the levels in the left-most split are listed.
The optimal split is noted by an asterisk. However, there are cases where the Candidate G2 is higher for one variable, but the Logworth is higher for a different variable. In this case > and < are used to point in the best direction for each variable. The asterisk corresponds to the condition where they agree. See “Statistical Details” for details.
Report for Continuous Responses
The sample data table Diabetes.jmp was used to create a report for the continuous response Y.
Figure 5.8 Partition Report for a Continuous Response
Partition Report for a Continuous Response
Partition Plot
The partition plot is initialized without any splits. Each point represents an observation in the data table. If validation is used, the plot is only for the training data.
Notice the following:
The vertical axis represents the response value of the observations.
Horizontal lines show the mean response value for each node of the decision tree. The initial horizontal line is at the overall mean of the response.
Vertical axis divisions represent splits in the decision tree. A text description of the most recent split appears below the horizontal axis. Observations are reorganized into their respective nodes as splits are created or removed.
Tip: To see tooltips for narrow partitions, place your cursor over the labels on the horizontal axis of the partition plot.
Summary Report
Figure 5.9 Summary Report for a Continuous Response
Summary Report for a Continuous Response
The Summary Report provides fit statistics for the training data and validation and test data (if used). The fit statistics in the Summary Panel update as you add splits or prune the decision tree.
RSquare
The current value of R2.
RMSE
The root mean square error.
N
The number of observations.
Number of Splits
The current number of splits in the decision tree.
AICc
The corrected Akaike’s Information Criterion. For more details, see the Statistical Details appendix in the Fitting Linear Models book.
Node Reports
Each node in the tree has a report and a red triangle menu with additional options. Terminal nodes also have a Candidates report.
Figure 5.10 Terminal Node Report for a Continuous Response
Terminal Node Report for a Continuous Response
Count
The number of observations (rows) in the branch.
Mean
The average response for all observations in that branch.
Std Dev
The standard deviation of the response for all observations in that branch.
Candidates
For each column, the Candidates report provides details about the optimal split for that column. The optimal split over all columns is marked with an asterisk.
Term
Shows the candidate columns.
Candidate SS
Sum of squares for the best split.
LogWorth
The LogWorth statistic, defined as -log10(p-value). The optimal split is the one that maximizes the LogWorth. See “Statistical Details” for additional details.
Cut Point
The value of the predictor that determines the split. For a categorical term, the levels in the left-most split are listed.
The optimum split is noted by an asterisk. However, there are cases where the Candidate SS is higher for one variable, but the Logworth is higher for a different variable. In this case > and < are used to point in the best direction for each variable. The asterisk corresponds to the condition where they agree. See “Statistical Details” for details.
Partition Platform Options
The Partition red triangle menu options give you the ability to customize reports according to your needs. The available options are determined by the type of data that you use for your analysis.
Display Options
Contains options that show or hide report elements.
Show Points
Shows the points. For categorical responses, this option shows the points or colored panels.
Show Tree
Shows the large tree of partitions.
Show Graph
Shows the partition graph.
Show Split Bar
(Categorical responses only) Shows the colored bars that indicate the split proportions in each leaf.
Show Split Stats
Shows the split statistics. For more information about the categorical split statistic G2, see “Statistical Details”.
Show Split Prob
(Categorical responses only) Shows the Rate and Prob statistics in the node reports.
JMP automatically shows the Rate and Prob statistics when you select Show Split Count. For more information about Rate and Prob, see “Statistical Details”.
Show Split Count
(Categorical responses only) Shows frequency counts in the node reports. When you select this option, JMP automatically selects Show Split Prob. And when you de-select Show Split Prob, the counts do not appear.
Show Split Candidates
Shows the Candidates report.
Sort Split Candidates
Sorts the Candidates reports by the statistic or the log(worth), whichever is appropriate.
Split Best
Splits the tree at the optimal split point. This is equivalent to clicking the Split button.
Prune Worst
Removes the terminal split that has the least discrimination ability. This is equivalent to clicking the Prune button.
Minimum Size Split
Define the minimum size split allowed by entering a number or a fractional portion of the total sample size. To specify a number, enter a value greater than or equal to 1. To specify a fraction of the sample size, enter a value less than 1. The default value is set to the maximum of 5, or the floor of the number of rows divided by 10,000.
Lock Columns
Interactively lock columns so that they are not considered for splitting. You can turn the display off or back on without affecting the individual locks.
Plot Actual by Predicted
(Continuous responses only) Shows a plot of actual values by predicted values. See “Actual by Predicted Plot”.
Small Tree View
Shows a small version of the partition tree to the right of the partition plot.
Tree 3D
Shows a 3-D plot of the tree structure. To access this option, hold down the Shift key and click the red triangle menu.
Leaf Report
Shows the mean and count or rates for the bottom-level leaves of the report.
Column Contributions
Shows a report indicating each input column’s contribution to the fit. The report also shows how many times it defined a split and the total G2 or Sum of Squares attributed to that column.
Split History
Shows a plot of RSquare versus the number of splits. If you use excluded row validation, holdback validation, or a validation column, separate curves are drawn for training and validation RSquare values. The RSquare curve is blue for the training set and red for the validation set. If you select K Fold Crossvalidation, the RSquare curve for all of the data is blue, and the curve for the crossvalidation RSquare is green.
K Fold Crossvalidation
Shows a Crossvalidation report that gives fit statistics for both the training and folded sets. For more information about validation, see “K-Fold Crossvalidation”.
ROC Curve
(Categorical responses only) Receiver Operating Characteristic (ROC) curves display the efficiency of a model’s fitted probabilities to sort the response levels. See “ROC Curve”.
Lift Curve
(Categorical responses only) Lift curves display the predictive ability of a partition model. See“Lift Curve”.
Show Fit Details
(Appears only for categorical responses.) The Fit Details report shows several measures of fit and provides a Confusion Matrix report. See “Show Fit Details”
Save Columns
Contains options for saving model and tree results, and creating SAS code.
Save Residuals
Saves the residual values from the model to the data table.
Save Predicteds
Saves the predicted values from the model to the data table.
Save Leaf Numbers
Saves the leaf numbers of the tree to a column in the data table.
Save Leaf Labels
Saves leaf labels of the tree to the data table. The labels document each branch that the row would trace along the tree. Each branch is separated by “&”. An example label might be: “size(Small,Medium)&size(Small)”. However, JMP does not include redundant information in the form of category labels that are repeated. A category label for a leaf might refer to an inclusive list of categories in a higher tree node. A caret (‘^”) appears where the tree node with redundant labels occurs. Therefore, “size(Small,Medium)&size(Small)” is presented as ^&size(Small).
Save Prediction Formula
Saves the prediction formula to a column in the data table. The formula consists of nested conditional clauses that describe the tree structure. If the response is continuous, the column contains a Predicting property. If the response is categorical, the column contains a Response Probability property.
Save Tolerant Prediction Formula
Saves a formula that predicts even when there are missing values and when Informative Missing has not been checked. The prediction formula tolerates missing values by randomly allocating response values for missing predictors to a split. If the response is continuous, the column contains a Predicting property. If the response is categorical, the column contains a Response Probability property. If you have checked Informative Missing, you can save the Tolerant Prediction Formula by holding the Shift key as you click the report’s red triangle.
Save Leaf Number Formula
Saves a column containing a formula in the data table that computes the leaf number.
Save Leaf Label Formula
Saves a column containing a formula in the data table that computes the leaf label.
Make SAS DATA Step
Creates SAS code for scoring a new data set.
Image shown herePublish Prediction Formula
Creates a prediction formula and saves it as a formula column script in the Formula Depot platform. If a Formula Depot report is not open, this option creates a Formula Depot report. See the “Formula Depot” chapter.
Image shown herePublish Tolerant Prediction Formula
Creates a tolerant prediction formula and saves it as a formula column script in the Formula Depot platform. If a Formula Depot report is not open, this option creates a Formula Depot report. See the “Formula Depot” chapter. If you have checked Informative Missing, you can use this option by holding the Shift key as you click on the report’s red triangle.
Specify Profit Matrix
(Available only for categorical responses.) Enables you to specify profits or costs associated with correct or incorrect classification decisions. For a nominal response, you can specify the profit matrix entries using a probability threshold. See “Show Fit Details”.
Profiler
Shows an interactive profiler report. Changes in the factor values are reflected in the estimated classification probabilities. See the Profiler chapter in the Profilers book.
Color Points
(Categorical responses only) Colors points based on their response level. This is equivalent to clicking the Color Points button. See “Report for Categorical Responses”.
See the JMP Reports chapter in the Using JMP book for more information about the following options:
Redo
Contains options that enable you to repeat or relaunch the analysis. In platforms that support the feature, the Automatic Recalc option immediately reflects the changes that you make to the data table in the corresponding report window.
Save Script
Contains options that enable you to save a script that reproduces the report to several destinations.
Save By-Group Script
Contains options that enable you to save a script that reproduces the platform report for all levels of a By variable to several destinations. Available only when a By variable is specified in the launch window.
Show Fit Details
Figure 5.11 Fit Details for Categorical Response (Y Binary from Diabetes.jmp)
Fit Details for Categorical Response (Y Binary from Diabetes.jmp)
Entropy RSquare
Compares the log-likelihoods from the fitted model and the constant probability model. Values closer to 1 indicate a better fit.
Generalized RSquare
A measure that can be applied to general regression models. It is based on the likelihood function L and is scaled to have a maximum value of 1. The Generalized RSquare measure simplifies to the traditional RSquare for continuous normal responses in the standard least squares setting. Generalized RSquare is also known as the Nagelkerke or Craig and Uhler R2, which is a normalized version of Cox and Snell’s pseudo R2. See Nagelkerke (1991). Values closer to 1 indicate a better fit.
Mean -Log p
The average of -log(p), where p is the fitted probability associated with the event that occurred. Smaller values indicate a better fit.
RMSE
The root mean square error, where the differences are between the response and p (the fitted probability for the event that actually occurred). Smaller values indicate a better fit.
Mean Abs Dev
The average of the absolute values of the differences between the response and p (the fitted probability for the event that actually occurred). Smaller values indicate a better fit.
Misclassification Rate
The rate for which the response category with the highest fitted probability is not the observed category. Smaller values indicate a better fit.
The Confusion Matrix report shows matrices for the training set and for the validation and test sets (if defined). The Confusion Matrix is a two-way classification of actual and predicted responses.
If the response has a Profit Matrix column property, or if you specify costs using the Specify Profit Matrix option, then a Decision Matrix report appears. See “Decision Matrix Report”.
Specify Profit Matrix
A profit matrix can be used with categorical responses. A profit matrix is used to assign costs to undesirable outcomes and profits to desirable outcomes.
Figure 5.12 Specify Profit Matrix Window
Specify Profit Matrix Window
You can assign profit and cost values to each combination of actual and predicted response categories. To specify the costs of classifying into an alternative category, enter values in the Undecided column. To save your assignments to the response column as a property, check Save to column as property. Leaving this option unchecked applies the Profit Matrix only to the current Partition report.
Probability Threshold Specification for Profit Matrix
When the response is binary, instead of entering weights into the profit matrix, you can specify a probability threshold in the Profit Matrix window. For details about how values are calculated for the profit matrix, see The Column Info Window chapter in the Using JMP book.
Target
The level whose probability is modeled.
Probability Threshold
A threshold for the probability of the target level. If the probability that an observation falls into the target level exceeds the probability threshold, the observation is classified into that level.
When you define costs using the Specify Profit Matrix option and then select Show Fit Details, a Decision Matrix report appears. See “Decision Matrix Report”.
When you specify a profit matrix and save the model prediction formula, the formula columns saved to the data table include the following:
Profit for <level>: For each level of the response, a column gives the expected profit for classifying each observation into that level.
Most Profitable Prediction for <column name>: For each observation, gives the level of the response with the highest expected profit.
Expected Profit for <column name>: For each observation, gives the expected profit for the classification defined by the Most Profitable Prediction column.
Actual Profit for <column name>: For each observation, gives the actual profit for classifying that observation into the level specified by the Most Profitable Prediction column.
Decision Matrix Report
Figure 5.13 Fit Details Report with Decision Matrix Report
Fit Details Report with Decision Matrix Report
Note: This report is available only if the response has a Profit Matrix column property or if you specify costs using the Specify Profit Matrix option. The report is part of the Fit Details report.
When a profit matrix is defined, the partition algorithm uses the values in the matrix to calculate the profit for each decision. When you select Show Fit Details, a Decision Matrix report appears.
In the Decision Matrix report, the decision counts reflect the most profitable prediction decisions based on the weighting in the profit matrix. The report gives Decision Count and Decision Rate matrices for the training set and for validation and test sets (if defined). For reference, the profit matrix is also shown.
Note: If you change the weights in your Profit Matrix using the Specify Profit Matrix option, the Decision Matrix report automatically updates to reflect your changes.
Decision Count Matrix
Shows a two-way classification with actual responses in rows and classification counts in columns.
Specified Profit Matrix
Gives the weights that define the Profit Matrix.
Decision Rate Matrix
Shows rate values corresponding to the proportion of a given row’s observations that are classified into each category. If all observations are correctly classified, the rates on the diagonal are all equal to one.
Tip: You can obtain a decision rate matrices for a response using the default profit matrix with costs of 1 and -1. Select Specify Profit Matrix from the red triangle menu, make no changes to the default values, and click OK.
The matrices are arranged in two rows:
The Decision Count matrices are in the first row.
The Specified Profit Matrix is to the right in the first row.
The Decision Rate matrices are in the second row.
Informative Missing
The Informative Missing option enables informative treatment of missing values on the predictors. The model that is fit is deterministic. The Informative Missing option is found on the launch window and is selected by default. When informative missing is selected the missing values are handled as follows:
Rows containing missing values for a categorical predictor are entered into the analysis as a separate level of the variable.
Rows containing missing values for a continuous predictor are assigned to a split as follows: The values of the continuous predictor are sorted. Missing rows are first considered to be on the low end of the sorted values. All splits are constructed. The missing rows are then considered to be on the high end of the sorted values. Again, all splits are constructed. The optimal split is determined using the LogWorth criterion. For further splits on the given predictor, the algorithm commits the missing rows to high or low values, as determined by the first split induced by that predictor.
If the Informative Missing option is not selected, the missing values are handled as follows:
When a predictor with missing values is used as a splitting variable, each row with a missing value on that predictor is randomly assigned to one of the two sides of the split.
The first time a predictor with missing values is used as a splitting variable an Imputes column is added to the Summary Report showing the number of imputations. As additional imputations are made, the Imputes column updates. See Figure 5.14, where five imputations were performed.
Note: The number of Imputes can be greater than the number of rows that contain missing values. The imputation occurs at each split. A row with missing values can be randomly assigned multiple times. Each time a row is randomly assigned it increments the imputation count.
Figure 5.14 Impute Message in Summary Report
Impute Message in Summary Report
Actual by Predicted Plot
For continuous responses, the Actual by Predicted plot is the typical plot of the actual response versus the predicted response. When you fit a Decision Tree, all observations in a leaf have the same predicted value. If there are n leaves, then the Actual by Predicted plot shows at most n distinct predicted values. The actual values form a scatter of points around each leaf mean on n vertical lines.
The diagonal line is the Y = X line. For a perfect fit, all the points would be on this diagonal. When validation is used, plots are shown for both the training and the validation sets. See Figure 5.15.
Figure 5.15 Actual by Predicted Plots for a Continuous Response
Actual by Predicted Plots for a Continuous Response
ROC Curve
The ROC Curve option is available only for categorical responses. Receiver Operating Characteristic (ROC) curves display the efficiency of a model’s fitted probabilities in sorting the response levels. An introduction to ROC curves is found in the Logistic Analysis chapter in the Basic Analysis book.
The predicted response for each observation in a partition model is a value between 0 and 1. To use the predicted response to classify observations as positive or negative, a cut point is used. For example, if the cut point is 0.5, an observation with a predicted response at or above 0.5 would be classified as positive, and an observation below 0.5 as negative. There are trade offs in classification as the cut point is varied.
To generate a ROC curve, each predicted response level is considered as a possible cut point and the following values are computed for each possible cut point:
The sensitivity is the proportion of true positives or the percent of positive observations with a predicted response greater than the cut point.
The specificity is the proportion of true negatives or the proportion of negative observations with a predicted response less than the cut point.
The ROC curve plots sensitivity against 1 - specificity. A partition model with n splits has n+1 predicted values. The ROC curve for the partition model has n+1 line segments.
If your response has more than two levels, the Partition report contains a separate ROC curve for each response level versus the other levels. Each curve is the representation of a level as the positive response level. If there are only two levels, one curve is the reflection of the other.
Figure 5.16 ROC Curves for a Three Level Response
ROC Curves for a Three Level Response
If the model perfectly rank-orders the response values, then the sorted data contains all of the positive values first, followed by all of the other values. In this situation, the curve moves all the way to the top before it moves at all to the right. If the model does not predict well, the curve follows the diagonal line from the bottom left to top right of the plot.
In practice, the ROC curve lies above the diagonal. The area under the curve is the indicator of the goodness of fit for the model. A value of 1 indicates a perfect fit and a value near 0.5 indicates that the model cannot discriminate among groups.
When your response has more than two levels, the ROC curve plot enables you to see which response categories have the largest area under the curve.
Lift Curve
The Lift Curve option provides another plot to display the predictive ability of a partition model. The lift curve plots the lift versus the portion of the observations. Each predicted response level defines a portion of the observations that are greater than or equal to that predicted response. The lift value is the ratio of the proportion of positive responses in that portion to the overall proportion of positive responses.
Figure 5.17 Lift Curve
Lift Curve
Figure 5.18 Lift Table for Lift Curve
Lift Table for Lift Curve
Figure 5.18 provides a table of values to demonstrate the calculation of Lift and Portion used for the High lift curve shown in Figure 5.17. A partition model with five splits was built to predict the response, Y Binary. Y Binary has two levels: Low and High. The lift curve is based on 309 observations. There are 83 High responses for an overall rate of 0.27.
Prob High: The five predicted values from the partition model for the High response level.
N > Prob High: The number of observations that have a predicted value equal to or greater than the value in Prob High.
Portion: N > Prob High divided by 309.
N High in Portion: The number of High responses in the portion.
Portion High: N > Prob High divided by Portion High.
Lift: Portion High divided by 0.27.
Lift measures how many High responses fall in each portion as compared to the expected number of High responses for that portion. For the first 6% of the data set the lift is 3.72. Using the model to select the 6% of the observations with the highest predicted values results in 3.72 more High responses than if that 6% were selected at random.
Node Options
This section describes the options on the red triangle menu for each node.
Split Best
Finds and executes the best split at or below this node.
Split Here
Splits at the selected node on the best column to split by.
Split Specific
Lets you specify where a split takes place. This is useful in showing what the criterion is as a function of the cut point, as well as in determining custom cut points. When specifying a splitting column, you can choose the following options for how the split is performed:
Optimal Value
Splits at the optimal value of the selected variable.
Specified Value
Enables you to specify the level where the split takes place.
Output Split Table
Produces a data table showing all possible splits and their associated split value.
Prune Below
Eliminates the splits below the selected node.
Prune Worst
Finds and removes the worst split below the selected node.
Select Rows
Selects the data table rows corresponding to this leaf. You can extend the selection by holding down the Shift key and choosing this command from another node.
Show Details
Produces a data table that shows the split criterion for a selected variable. The data table, composed of split intervals and their associated criterion values, has an attached script that produces a graph for the criterion.
Lock
Prevents a node or its subnodes from being chosen for a split. When checked, a lock icon appears in the node title.
Validation
If you build a tree with enough splits, partitioning can overfit data. When this happens, the model predicts the data used to build the model very well, but predicts future observations poorly. Validation is the process of using part of a data set to estimate model parameters, and using the other part to assess the predictive ability of the model.
The training set is the part that is used to estimate model parameters.
The validation set is the part that assesses or validates the predictive ability of the model.
The test set is a final, independent assessment of the model’s predictive ability. The test set is available only when using a validation column. See “Launch the Partition Platform”.
When a validation method is used, the Go button appears. The Go button provides for repeated splitting without having to repeatedly click the Split button. When you click the Go button, splitting occurs until the validation R-Square is better than what the next 10 splits would obtain. This rule can result in complex trees that are not very interpretable, but have good predictive power.
Using the Go button turns on the Split History command. If using the Go button results in a tree with more than 40 nodes, the Show Tree command is turned off.
The training, validation, and test sets are created by subsetting the original data into parts. Select one of the following methods to subset a data set:
Excluded Rows
Uses row states to subset the data. Rows that are unexcluded are used as the training set, and excluded rows are used as the validation set.
For more information about using row states and how to exclude rows, see the Enter and Edit Data chapter in the Using JMP book.
Holdback
Randomly divides the original data into the training and validation data sets. The Validation Portion on the platform launch window is used to specify the proportion of the original data to use as the validation data set (holdback). See “Launch the Partition Platform” for details about the Validation Portion.
KFold Crossvalidation
Randomly divides the original data into K subsets. In turn, each of the K sets is used to validate the model fit on the rest of the data, fitting a total of K models. The final model is selected based on the cross validation RSquare, where a stopping rule is imposed to avoid overfitting the model. This method is useful for small data sets, because it makes efficient use of limited amounts of data. See “K-Fold Crossvalidation”.
Image shown hereValidation Column
Uses a column’s values to divide the data into subsets. A validation column must contain at most three numeric values. The column is assigned using the Validation role on the Partition launch window. See “Launch the Partition Platform”.
The column’s values determine how the data is split:
If the validation column has two levels, the smaller value defines the training set and the larger value defines the validation set.
If the validation column has three levels, the values, in order of increasing size, define the training, validation, and test sets.
If you click the Validation button with no columns selected in the Select Columns list, you can add a validation column to your data table. For more information about the Make Validation Column utility, see “Make Validation Column Utility”.
K-Fold Crossvalidation
In K-Fold cross validation, the entire set of observations is partitioned into K subsets, called folds. Each fold is treated as a holdback sample with the remaining observations as a training set.
Unconstrained optimization of the crossvalidation RSquare value tends to overfit models. To address this tendency, the KFold crossvalidation stopping rule terminates stepping when improvement in the crossvalidation RSquare is minimal. Specifically, the stopping rule selects a model for which none of the next ten models have a crossvalidation RSquare showing an improvement of more than 0.005 units.
When you select the K Fold Crossvalidation option, a Crossvalidation report appears. The results in this report update as you split the decision tree. Or, if you click Go, the outline shows the results for the final model.
Crossvalidation Report
The Crossvalidation report shows the following:
k-fold
Number of folds.
-2LogLike or SSE
Gives twice the negative log-likelihood (-2LogLikelihood) values when the response is categorical. Gives sum of squared errors (SSE) when the response is continuous. The first row gives results averaged over the folds. The second row gives results for the single model fit to all observations.
RSquare
The first row gives the RSquare value averaged over the folds. The second row gives the RSquare value for the single model fit to all observations.
Additional Examples of Partitioning
The following examples illustrate a continuous response, missing data in the predictors, and the use of the profit matrix.
Example of a Continuous Response
In this example, you use the Partition platform to construct a decision tree that predicts the one-year disease progression measured on a quantitative scale for patients with diabetes.
1. Select Help > Sample Data Library and Diabetes.jmp.
2. Select Analyze > Predictive Modeling > Partition.
3. Select Y and click Y, Response.
4. Select Age through Glucose and click X, Factor.
5. Select a validation procedure based on your JMP installation:
For JMP Pro, select Validation and click Validation.
For JMP, enter 0.3 as the Validation Proportion.
The completed launch window for JMP users is shown in Figure 5.19.
Note: Results using the validation proportion can differ from those shown here due to the random selection of validation rows.
Figure 5.19 Completed Launch Window with Validation Portion = 0.3
Completed Launch Window with Validation Portion = 0.3
6. Click OK.
7. On the platform report window, click Split once to perform a split.
Figure 5.20 Report after First Split with Decision Tree Hidden
Report after First Split with Decision Tree Hidden
The original 309 values in the training data set are now split into two parts:
The left leaf, corresponding to LTG < 4.6444, has 165 observations.
The right leaf, corresponding to LTG >= 4.6444 has 144 observations.
For both the right and left leaf the next split would be on BMI. The Candidate SS for BMI on the right leaf is higher than the Candidate SS for BMI on the left leaf. Thus, the next split is on the right leaf.
8. Click Go to use automatic splitting.
Figure 5.21 Report after Automatic Splitting with Validation
Report after Automatic Splitting with Validation
The solution found has four splits. The Split History plot shows that there is no further improvement in the validation data set after four splits. The RSquare value of 0.39 on the validation data does not support this model as a strong predictor of disease progression. The scatter across partitions in the partition plot further indicate that this model does not separate the Y levels well.
Example of Informative Missing
In this example, you construct a decision tree model to predict if a customer is a credit risk. Since your data set contains missing values, you also explore the effectiveness of the Informative Missing option.
Launch the Partition Platform
1. Select Help > Sample Data Library and open Equity.jmp.
2. Select Analyze > Predictive Modeling > Partition.
3. Select BAD and click Y, Response.
4. Select LOAN through DEBTINC and click X, Factor.
5. Click OK.
Create the Decision Tree and ROC Curve with Informative Missing
1. Hold down the Shift key and click Split.
2. Enter 5 for the number of splits and click OK.
3. Click the red triangle next to Partition for BAD and select ROC Curve.
4. Click the red triangle next to Partition for BAD and select Save Columns > Save Prediction Formula.
The columns Prob(BAD==Good Risk) and Prob(BAD==Bad Risk) contain the formulas that Informative Missing utility uses to classify the credit risk of future loan applicants. You are interested in how this model performs in comparison to a model that does not use informative missing.
Create the Decision Tree and ROC Curve without Informative Missing
1. Click the red triangle next to Partition for BAD and select Redo > Relaunch Analysis
2. De-select Informative Missing.
The columns Prob(BAD==Good Risk) 2 and Prob(BAD==Bad Risk) 2 contain the formulas that do not use the informative missing utility.
Compare the ROC Curves
Visually compare the ROC curves from the two models. The model at left is with Informative Missing, and the model at right is without Informative Missing.
Figure 5.22 ROC Curves for Models with (Left) and without (Right) Informative Missing
ROC Curves for Models with (Left) and without (Right) Informative MissingROC Curves for Models with (Left) and without (Right) Informative Missing
The area under the curve (AUC) for the model with informative missing (0.8695) is higher than the AUC for the model without informative missing (0.7283). Because there are only two levels for the response, the ROC curves for each model are reflections of one another and the AUCs are equal.
Note: Your AUC can differ from that shown for the model without informative missing. When informative missing is not used, the assignment of missing rows to sides of a split is random. Rerunning the analysis can result in slight differences in results.
Use the Model Comparison Platform
Next, compare the models using the Model Comparison platform to compare the two sets of formulas that you created in step 4 and step 3.
1. Select Analyze > Predictive Modeling > Model Comparison.
2. Select Prob(BAD==Good Risk), Prob(BAD==Bad Risk), Prob(BAD==Good Risk) 2, and Prob(BAD==Bad Risk) 2 and click Y, Predictors.
The first pair of formula columns contain the formulas from the model with informative missing. The second pair of formula columns contain the formulas from the model without informative missing.
3. Click OK.
Figure 5.23 Measures of Fit from Model Comparison
Measures of Fit from Model Comparison
The Measures of Fit report shows that the first model, which was fit with informative missing, performs better than the second model, which was not fit with informative missing. The first model has higher RSquare values as well as a lower RMSE value and a lower Misclassification Rate. These findings align the ROC curves comparison.
Note: Again, your results can differ due to the random differences when Informative Missing is not used.
Example of Profit Matrix and Decision Matrix Report
For this example, consider a study of patients who have liver cancer. Based on various measurements and markers, you want to classify patients according to their disease severity (high or low). There are two errors that one can make in classification of patients: classifying a subject who has high severity into the low group, or classifying a patient with low severity into the high group. Clinically, the misclassification of a high patient as low is a costly error, as that patient might not receive the aggressive treatment needed. Classifying a patient with low severity into the high severity group is a less costly error. That patient might receive the more aggressive treatment than needed, but this is not a major concern.
In the following example, you define a profit matrix in the context of a liver cancer study and obtain a Decision Matrix report. The Decision Matrix report helps you assess your classification rates relative to the costs in your profit matrix.
1. Select Help > Sample Data Library and open Liver Cancer.jmp.
2. Select Analyze > Predictive Modeling > Partition.
3. Select Severity and click Y, Response.
4. Select BMI through Jaundice and click X, Factor.
5. Select a validation procedure based on your JMP installation:
For JMP Pro, select Validation and click Validation.
For JMP, enter 0.3 as the Validation Proportion.
Note: Results using the validation proportion can differ from those shown here, due to the random selection of validation rows.
Figure 5.24 Completed Launch Window with Validation Portion = 0.3
Completed Launch Window with Validation Portion = 0.3
6. Click OK.
7. Hold down the Shift key and click Split.
8. Enter 10 for the number of splits and click OK.
Check that the Number of Splits is 10 in the panel beneath the plot.
9. Click the red triangle next to Partition for Severity and select Specify Profit Matrix.
10. Change the entries as follows:
Enter 1 in the High, High box.
Enter -5 in the High, Low box.
Enter -3 in the Low, High box.
Enter 1 in the Low, Low box.
Figure 5.25 Completed Profit Matrix
Completed Profit Matrix
Tip: You can save this profit matrix as a column property for use in later analyses. Select the check box “Save to column as property” at the bottom of the profit matrix window.
Note the following:
Each value of 1 reflects your profit when you make a correct decision.
The -3 value indicates that if you classify a Low severity patient as High severity, your loss is 3 times as much as the profit of a correct decision.
The -5 value indicates that if you classify a High severity patient as Low severity, your loss is 5 times as much as the profit of a correct decision.
11. Click OK.
12. Click the red triangle next to Partition for Severity and select Show Fit Details.
Figure 5.26 Confusion Matrix and Decision Matrix Reports
Confusion Matrix and Decision Matrix Reports
The Confusion Matrix and Decision Matrix reports follow the list of Measures in the Fit Details report. Notice that the Confusion Matrix report and the confusion matrices in the Decision Matrix report show different counts. This is because the weighting in the profit matrix results in different decisions than do the predicted probabilities without weighting.
The Confusion Matrix for the validation set shows classifications based on predicted probabilities alone. Based on these, 11 High severity patients would be classified as Low severity and 5 Low severity patients would be classified as High severity.
The Decision Matrix report incorporates the profit matrix weights. Using those weights, only 6 High severity patients are classified as Low severity. However, this comes at the expense of misclassifying 6 Low severity patients into the High severity group (1 additional patient).
13. Click the red triangle next to Partition for Severity and select Save Columns > Save Prediction Formula.
Eight columns are added to the data table.
Tip: To quickly return to the data table, click the View Associated Data icon Image shown here in the bottom right corner of the report window.
The first three columns involve only the predicted probabilities. The confusion matrix counts are based on the Most Likely Severity column, which classifies a patient into the level with the highest predicted probability. These probabilities are given in the Prob(Severity == High) and Prob(Severity == Low) columns.
The last five columns involve the profit matrix weighting. The column called Most Profitable Prediction for Severity contains the decision based on the profit matrix. The decision for a patient is the level that results in the largest profit. The profits are given in the Profit for High and Profit for Low columns.
Statistical Details
This section provides quantitative details and additional information.
Responses and Factors
The response can be either continuous or categorical (nominal or ordinal):
If the response is categorical, then it is fitting the probabilities estimated for the response levels, minimizing the residual log-likelihood chi-square [2*entropy].
If the response is continuous, then the platform fits means, minimizing the sum of squared errors.
The factors can be either continuous or categorical (nominal or ordinal):
If the factor is continuous, then the partition is done according to a splitting “cut” value for the factor.
If the factor is categorical, then it divides the X categories into two groups of levels and considers all possible groupings into two levels.
Splitting Criterion
Node splitting is based on the LogWorth statistic, which is reported in Candidate reports for nodes. LogWorth is calculated as follows:
-log10(p-value)
where the adjusted p-value is calculated in a complex manner that takes into account the number of different ways splits can occur. This calculation is very fair compared to the unadjusted p-value, which favors Xs with many levels, and the Bonferroni p-value, which favors Xs with small numbers of levels. Details about the method are discussed in Sall (2002).
For continuous responses, the Sum of Squares (SS) is reported in node reports. This is the change in the error sum-of-squares due to the split.
A candidate SS that has been chosen is:
SStest = SSparent - (SSright + SSleft) where SS in a node is just s2(n - 1).
Also reported for continuous responses is the Difference statistic. This is the difference between the predicted values for the two child nodes of a parent node.
For categorical responses, the G2 (likelihood-ratio chi-square) appears in the report. This is actually twice the [natural log] entropy or twice the change in the entropy. Entropy is Σ -log(p) for each observation, where p is the probability attributed to the response that occurred.
A candidate G2 that has been chosen is:
G2 test = G2 parent - (G2 left + G2 right).
Partition actually has two rates; one used for training that is the usual ratio of count to total, and another that is slightly biased away from zero. By never having attributed probabilities of zero, this allows logs of probabilities to be calculated on validation or excluded sets of data, used in Entropy R-Square.
Predicted Probabilities in Decision Tree and Bootstrap Forest
The predicted probabilities for the Decision Tree and Bootstrap Forest methods are calculated as described below by the Prob statistic.
For categorical responses in Decision Tree, the Show Split Prob command shows the following statistics:
Rate
The proportion of observations at the node for each response level.
Prob
The predicted probability for that node of the tree. The method for calculating Prob for the ith response level at a given node is as follows:
Probi =Equation shown here
where the summation is across all response levels; ni is the number of observations at the node for the ith response level; and priori is the prior probability for the ith response level, calculated as:
priori = λpi+ (1-λ)Pi
where pi is the priori from the parent node, Pi is the Probi from the parent node, and λ is a weighting factor currently set at 0.9.
The estimate, Prob, is the same that would be obtained for a Bayesian estimate of a multinomial probability parameter with a conjugate Dirichlet prior.
The method for calculating Prob assures that the predicted probabilities are always nonzero.
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset