Chapter 13: Bootstrap Forests and Boosted Trees

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 13: Bootstrap Forests and Boosted Trees

Introduction

Bootstrap Forests

Understand Bagged Trees

Perform a Bootstrap Forest

Perform a Bootstrap Forest for Regression Trees

Boosted Trees

Understand Boosting

Perform Boosting

Perform a Boosted Tree for Regression Trees

Use Validation and Training Samples

Exercises

Introduction

Decision Trees, discussed in Chapter 10, are easy to comprehend, easy to explain, can handle qualitative variables without the need for dummy variables, and (as long as the tree isn’t too large) are easily interpreted. Despite all these advantages, trees suffer from one grievous problem: They are unstable.

In this context, unstable means that a small change in the input can cause a large change in the output. For example, if one variable is changed even a little, and if the variable is important, then it can cause a split high up in the tree to change and, in so doing, cause changes all the way down the tree. Trees can be very sensitive not just to changes in variables, but also to the inclusion or exclusion of variables.

Fortunately, there is a remedy for this unfortunate state of affairs. As shown in Figure 13.1, this chapter discusses two techniques, Bootstrap Forests and Boosted Trees, which overcome this instability and many times result in better models.

Figure 13.1: A Framework for Multivariate Analysis

Bootstrap Forests

The first step in constructing a remedy involves a statistical method known as “the bootstrap.” The idea behind the bootstrap is to take a single sample and turn it into several “bootstrap samples,” each of which has the same number of observations as the original sample. In particular, a bootstrap sample is produced by random sampling with replacement from the original sample. These several bootstrap samples are then used to build trees. The results for each observation for each tree are averaged to obtain a prediction or classification for each observation. This averaging process implies that the result will not be unstable. Thus, the bootstrap remedies the great deficiency of trees.

This chapter does not dwell on the intricacies of the bootstrap method. (If interested, see “The Bootstrap,” an article written by Shalizi (2010) in American Scientist. Suffice it to say that bootstrap methods are very powerful and, in general, do no worse than traditional methods that analyze only the original sample, and very often (as in the present case) can do much better.

It seems obvious, now, that you should take your original sample, turn it into several bootstrap samples, and construct a tree for each bootstrap sample. You could then combine the results of these several trees. In the case of classification, you could grow each tree so that it classified each observation—knowing that each tree would not classify each observation the same way.

Bootstrap forests, also called random forests in the literature, are a very powerful method, probably the most powerful method, presented in this book. On any particular problem, some other method might perform better. In general, however, bootstrap forests will perform better than other methods. Beware, though, of this great power. On some data sets, bootstrap forests can fit the data perfectly or almost perfectly. However, such a model will not predict perfectly or almost perfectly on new data. This is the phenomenon of “overfitting” the data, which is discussed in detail in Chapter 14. For now, the important point is that there is no reason to try to fit the data as well as possible. Just try to fit it well enough. You might use other algorithms as benchmarks, and then see whether bootstrap forests can do better.

Understand Bagged Trees

Suppose you grew 101 bootstrap trees. Then you would have 101 classifications (“votes”) for the first observation. If 63 of the votes were “yes” and 44 were “no”, then you would classify the first observation as a “yes.” Similarly, you could obtain classifications for all the other observations. This method is called “bagged trees,” where “bag” is shorthand for “bootstrap aggregation”—bootstrap the many trees and then aggregate the individual answers from all the trees. A similar approach can obtain predictions for each observation in the case of regression trees. This method uses the same data to build the tree and to compute the classification error.

An alternative method of obtaining predictions from bootstrapped trees is the use of “in-bag” and “out-of-bag” observations. Some observations, say two-thirds, are used to build the tree (these are the “in-bag” observations) and then the remaining one-third out-of-bag observations are dropped down the tree to see how they are classified. The predictions are compared to the truth for the out-of-bag observations, and the error rate is calculated on the out-of-bag observations. The reasons for using out-of-bag observations will be discussed more fully in Chapter 14. Suffice it to say that using the same observations to build the tree and then also to compute the error rate results in an overly optimistic error rate that can be misleading.

There is a problem with bagged trees, and it is that they are all quite similar, so their structures are highly correlated. We could get better answers if the trees were not so correlated, if each of the trees was more of an independent solution to the classification problem at hand. The way to achieve this was discovered by Breiman (2001). Breiman’s insight was to not use all the independent variables for making each split. Instead, for each split, a subset of the independent variables is used.

To see the advantage of this insight, consider a node that needs to be split. Suppose variable X1 would split this node into two child nodes. Each of the two child nodes contains about the same number of observations, and each of the observations is only moderately homogeneous. Perhaps variable X2 would split this into two child nodes. One of these child nodes is small but relatively pure; the other child node is much larger and moderately homogenous. If X1 and X2 have to compete against each other in this spot, and if X1 wins, then you would never uncover the small, homogeneous node. On the other hand, if X1 is excluded and X2 is included so that X2 does not have to compete against X1, then the small, homogeneous pocket will be uncovered. A large number of trees is created in this manner, producing a forest of bootstrap trees. Then, after each tree has classified all the observations, voting is conducted to obtain a classification for each observation. A similar approach is used for regression trees.

Perform a Bootstrap Forest

To demonstrate Bootstrap Forests, use the Titanic data set, TitanicPassengers.jmp, the variables of which are described below in Table 13.1. It has 1,309 observations.

Table 13.1: Variables in the TitanicPassengers.jmp Data Set

Variable	Description
Passenger Class *	1 = first, 2 = second, 3 = third
Survived *	No, Yes
Name	Passenger name
Sex *	Male, female
Age *	Age in years
Siblings and Spouses *	Number of Siblings and Spouses aboard
Parents and Children *	Number of Parents and Children aboard
Ticket #	Ticket number
Fare *	Fare in British pounds
Cabin	Cabin number (known only for a few passengers)
Port *	Q = Queenstown, C = Cherbourg, S = Southampton
Lifeboat	16 lifeboats 1–16 and four inflatables A–D
Body	Body identification number for deceased
Home/Destination	Home or destination of traveler

You want to predict who will survive:

1. Open the Titanic Passengers.jmp data set.

2. In the course of due diligence, you will engage in exploratory data analysis before beginning any modeling. This exploratory data analysis will reveal that Body correlates perfectly with not surviving (Survived), as selecting Analyze ▶ Tabulate (or Fit Y by X), for these two variables will show. Also, Lifeboat correlates very highly with surviving (Survived), because very few of the people who got into a lifeboat failed to survive. So use only the variables marked with an asterisk in Table 13.1.

3. Select Analyze ▶ Modeling ▶ Partition.

4. Select Survived as Y, response. The other variables with asterisks in Table 13.1 are X, Factor.

5. For Method, choose Bootstrap Forest. Validation Portion is zero by default. Validation will be discussed in Chapter 14. For now, leave this at zero.

6. Click OK.

Understand the Options in the Dialog Box

Some of the options presented in the Bootstrap Forest dialog box, shown in Figure 13.2, are as follows:

● Number of trees in the forest is self-explanatory. There is no theoretical guidance on what this number should be. But empirical evidence suggests that there is no benefit to having a very large forest. 100 is the default. Try also 300 and 500. Probably, setting the number of trees to be in the thousands will not be helpful.

● Number of terms sampled per split is the number of variables to use at each split. If the original number of predictors is p, use √p rounded down for classification, and for regression use p/3 rounded down (Hastie et al. 2009, p. 592). These are only rough recommendations. After trying √p, try 2√p and √p/2, as well as other values, if necessary.

● Bootstrap sample rate is the proportion of the data set to resample with replacement. Just leave this at the default 1 so that the bootstrap samples have the same number of observations as the original data set.

◦ Minimum Splits Per Tree and Maximum Splits Per Tree are self-explanatory.

◦ Minimum Size Split is the minimum number of observations in a node that is a candidate for splitting. For classification problems, the minimum node size should be one. For regression problems, the minimum node size should be five as recommended by Hastie et al. (2009, page 592).

◦ Do not check the box Multiple Fits over number of terms. The associated Max Number of Terms is only used when the box is checked. The interested reader is referred to the user guide for further details.

Figure 13.2: The Bootstrap Forest Dialog Box

For now, do not change any of the options and just click OK.

The output of the Bootstrap Forest should look like Figure 13.3.

Figure 13.3: Bootstrap Forest Output for the Titanic Passengers Data Set

Select Options and Relaunch

Your results will be slightly different because this algorithm uses a random number generator to select the bootstrap samples. The sample size is 1,309. The lower left value in the first column of the Confusion Matrix, 261, and the top right value in the right-most column, 24, are the classification errors (discussed further in Chapter 14). Added together, 261 + 24 = 285, they compose the numerator of the reported misclassification rate in Figure 13.3: 285/1309 = 21.77%. Now complete the following steps:

1. Click the red triangle next to Bootstrap Forest for Survived.

2. Select Script ▶ Relaunch Analysis.

3. The Partition dialog box appears. Click OK.

4. Now you are back in the Bootstrap Forest dialog box, as in Figure 13.2. This time, double the Number of Terms Sampled Per Split to 2.

5. Click OK.

The Bootstrap Forest output should look similar to Figure 13.4.

Figure 13.4: Bootstrap Forest Output with the Number of Terms Sampled per Split to 2

Examine the Improved Results

Notice the dramatic improvement. The error rate is now 16.5%. You could run the model again, this time increasing the Number of Terms Sampled Per Split to 3 and increasing the Number of Trees to 500. These changes will again produce another dramatic improvement. Notice also that, although there are many missing values in the data set, Bootstrap Forest uses the full 1309 observations. Many other algorithms (for example, logistic regression) have to drop observations that have missing values.

An additional advantage of Random Forests is that, just as basic Decision Trees in Chapter 10 produced Column Contributions to show the important variables, Random Forests produces a similar ranking of variables. To get this list, click the red triangle next to Bootstrap Forest for Survived and select Column Contributions. This ranking can be especially useful in providing guidance for variable selection when later building logistic regressions or neural network models.

Perform a Bootstrap Forest for Regression Trees

Now briefly consider random forests for regression trees. Use the data set MassHousing.jmp in which the target variable is median value:

1. Select Analyze ▶ Modeling ▶ Partition.

2. Select mvalue for Y, Response and all the other variables as X, Factor.

3. For method, select Bootstrap Forest.

4. Click OK.

5. In the Bootstrap Forest dialog box, leave everything at default and click OK.

The Bootstrap Forest output should look similar to Figure 13.5.

Figure 13.5: Bootstrap Forest Output for the Mass Housing Data Set

Under Overall Statistics, see the In-Bag and Out-of-Bag RMSE. Notice that the Out-of-Bag RMSE is much larger than the In-Bag RMSE. This is to be expected, because the algorithm is fitting on the In-Bag data. It then applies the estimated model to data that were not used to fit the model to obtain the Out-of-Bag RMSE. You will learn much more about this topic in Chapter 14. What’s important for your purposes is that you obtained RSquare = 0.879 and RMSE = 3.365058 for the full data set (remember that your results will be different because of the random number generator). These values compare quite favorably with the results from a linear regression: RSquare = 0.7406 and RMSE = 4.745. You can see that Bootstrap Forest regression can offer a substantial improvement over traditional linear regression.

Boosted Trees

Boosting is a general approach to combining a sequence of models, in which each successive model changes slightly in response to the errors from the preceding model.

Understand Boosting

Boosting starts with estimating a model and obtaining residuals. The observations with the biggest residuals (where the model did the worst job) are given additional weight, and then the model is re-estimated on this transformed data set. In the case of classification, the misclassified observations are given more weight. After several models have been constructed, the estimates from these models are averaged to produce a prediction or classification for each observation. As was the case with Bootstrap Forests, this averaging implies that the predictions or classifications from the Boosted Tree model will not be unstable. When boosting, there is often no need to build elaborate models; simple models often suffice. In the case of trees, there is no need to grow the tree completely out; a tree with just a few splits often will do the trick. Indeed, simply fitting “stumps” (trees with only a single split and two leaves) at each stage often produces good results.

A boosted tree builds a large tree by fitting a sequence of smaller trees. At each stage, a smaller tree is grown on the scaled residuals from the prior stage, and the magnitude of the scaling is governed by a tuning parameter called the learning rate. The essence of boosting is that, on the current tree, it gives more weight to the observations that were misclassified on the prior tree.

Perform Boosting

Use boosted trees on the data set TitanicPassengers.jmp:

1. Select Analyze ▶ Modeling ▶ Partition.

2. For Method, select Boosted Tree.

3. Use the same variables as you did with Bootstrap Forests. So select Survived as Y, response. The other variables with asterisks in Table 13.1 are X, Factor.

4. Click OK.

The Boosted Tree dialog box will appear, as shown in Figure 13.6.

Figure 13.6: The Boosted Tree Dialog Box