Predicting complex skill learning with boosting

We will revisit our Skillcraft dataset in this section--this time in the context of another boosting technique known as stochastic gradient boosting. The main characteristic of this method is that in every iteration of boosting, we compute a gradient in the direction of the errors that are made by the model trained in the current iteration.

This gradient is then used in order to guide the construction of the model that will be added in the next iteration. Stochastic gradient boosting is commonly used with decision trees, and a good implementation in R can be found in the gbm package, which provides us with the gbm() function. For regression problems, we need to specify the distribution parameter to be gaussian. In addition, we can specify the number of trees we want to build (which is equivalent to the number of iterations of boosting) via the n.trees parameter, as well as a shrinkage parameter that is used to control the algorithm's learning rate:

> boostedtree <- gbm(LeagueIndex ~ ., data = skillcraft_train, 
  distribution = "gaussian", n.trees = 10000, shrinkage = 0.1)

Note

To learn more about how stochastic gradient boosting works, a good source to consult is the paper titled Stochastic Gradient Boosting. This was written by Jerome H. Friedman and appears in the February 2002 issue of the journal Computational Statistics & Data Analysis.

In order to make predictions with this setup, we need to use the gbm.perf() function, whose job it is to take the boosted model we built and pick out the optimal number of boosting iterations. We can then provide this to our predict() function in order to make predictions on our test data. To measure the SSE on our test set, we will use the compute_SSE() function that we wrote in Chapter 6, Tree-based Methods:

> best.iter <- gbm.perf(boostedtree, method = "OOB")
> boostedtree_predictions <- predict(boostedtree, 
                                     skillcraft_test, best.iter)
> (boostedtree_SSE <- compute_SSE(boostedtree_predictions, 
                                  skillcraft_test$LeagueIndex))
[1] 555.2997

A bit of experimentation has revealed that we can't get substantially better results than this by allowing the algorithm to iterate over more trees. Despite this, we are already performing better using this method than both the single and bagged tree classifiers.

Limitations of boosting

Boosting is a very powerful technique that continues to receive a lot of attention and research, but it is not without its limitations. Boosting relies on combining weak learners together. In particular, we can expect to get the most out of boosting when the models that are used are not already complex models themselves. We already saw an example of this with neural networks, by noting that the more complex architecture of three hidden neurons gives a better learner to begin with than the simpler architecture of a single hidden neuron.

Combining weak learners may be a way to reduce overfitting, but this is not always effective. By default, boosting uses all of its training data and progressively tries to correct mistakes that it makes without any penalizing or shrinkage criterion (although the individual models trained may themselves be regularized). Consequently, boosting can sometimes overfit.

Finally, a very important limitation is that many boosting algorithms have a symmetric loss function. Specifically, there is no distinction that is made in classification between a false positive classification error and a false negative classification error. Every type of error is treated the same when the observation weights are updated.

In practice, this might not be desirable, in that one of the two errors may be more costly. For example, on the website for our Major Atmospheric Gamma Imaging Cherenkov (MAGIC) telescope dataset, the authors state that a false positive of detecting gamma rays where there are none, is worse than a false negative of misclassifying gamma rays as background radiation. Cost-sensitive extensions of boosting algorithms have been proposed, however.

Random forests

The final ensemble model that we will discuss in this chapter is unique to tree-based models and is known as the random forest. In a nutshell, the idea behind random forests stems from an observation on bagging trees. Let's suppose that the actual relationship between the features and the target variable can be adequately described with a tree structure. It is quite likely that during bagging with moderately sized bootstrapped samples, we will keep picking the same features to split on high up in the tree.

For example, in our Skillcraft dataset, we expect to see APM as the feature that will be chosen at the top of most of the bagged trees. This is a form of tree correlation that essentially impedes our ability to derive the variance reduction benefits from bagging. Put differently, the different tree models that we build are not truly independent of each other because they will have many features and split points in common. Consequently, the averaging process at the end will be less successful in reducing the ensemble variance.

To counteract this effect, the random forest algorithm introduces an element of randomization in the tree construction process. Just as with bagging, random forests involve building a number of trees with bootstrapped samples and using the average of their predictions to form the ensemble prediction. When we construct individual trees, however, the random forest algorithm imposes a constraint.

At each node in the tree, we draw a random sample of size mtry from the total number of input features. Whereas in regular tree construction, we consider all the features at each node to determine which one to split on, with random forests, we only consider features from the sample we created for that node. We can often use a relatively small number for mtry.

The number of trees we build, in combination with the fact that each tree has several nodes, is often enough to ensure that the more important features are sampled a sufficient number of times. Various heuristics have been proposed for choosing appropriate values for this parameter, such as one third or the square root of the total number of features available.

This sampling step effectively forces the structure of the bagged trees to be different from each other and offers a number of different benefits. Feature sampling allows us to consider input features that are successful in splitting the data for only a small range of the target variable. These locally relevant features are rarely chosen without the sampling constraint because we usually prefer features that form good overall splits of the data at a given node in the tree. Nonetheless, we may want to include these features in our model if we don't want to overlook local variations in the output.

Similarly, sampling input features is useful when we have correlated input features. Regular tree construction tends to favor only one of the features from a correlated set while ignoring the rest despite the fact that the resulting splits from even highly correlated features are not exactly the same. When we sample input features we are less likely to have correlated features compete with each other and so we can choose a wider range of features to use with our model.

In general, the randomized nature of the sampling process is designed to combat overfitting because we can think of this process as applying regularization on the impact of each input feature. Overfitting can still be a problem if we happen to have too many input features unrelated to the target variable compared to those that are related, but this is a fairly rare scenario. Random forests in general scale quite favorably with the number of input features, precisely because of this sampling process that doesn't require us to consider all the features when splitting at each node. In particular, this model is a good choice when the number of features exceeds the number of observations. Finally, the sampling process mitigates the cost of constructing a large number of trees again because we consider a subset of input features when deciding on how to split at each node. The number of trees is another tuning parameter that we must decide on in a random forest model; it is very common to build anywhere between several hundred and a few thousand trees.

In R, we can use the randomForest package in order to train random forest models. The randomForest() function takes in a formula and a training data frame, as well as a number of other optional parameters. Of particular interest is the ntree parameter, which controls the number of trees that will be built for the ensemble, and the mtry parameter, which is the number of features sampled for use at each node for splitting. These parameters should be tuned by trying out different configurations, and we can use the tune() function from the e1071 package to do just that:

> library("randomForest")
> library("e1071")
> rf_ranges <- list(ntree = c(500, 1000, 1500, 2000), mtry = 3:8)
> rf_tune <- tune(randomForest, LeagueIndex ~ ., data = 
                  skillcraft_train, ranges = rf_ranges)
> rf_tune$best.parameters
   ntree mtry
14  1000    6
> rf_best <- rf_tune$best.model
> rf_best_predictions <- predict(rf_best, skillcraft_test)
> (rf_best_SSE <- compute_SSE(rf_best_predictions, 
                              skillcraft_test$LeagueIndex))
[1] 555.7611

The results show that on this dataset, the best parameter combination is to train 1,000 trees and use a value of 6 for mtry. This last value corresponds to one third of the number of input features, which is the typical heuristic for regression problems. The SSE value on our test set is almost identical to what we obtained using gradient boosting.

The importance of variables in random forests

We already discussed the fact that ensemble models do not, in general, have explanative power. For random forests, it turns out that we can still measure variable importance scores for the different input features by tallying and keeping track of the reductions in our error function across all the trees in the ensemble. In this way, we can obtain an analogous plot to the one we obtained for a single tree when we looked at this dataset in Chapter 6, Tree-based Methods.

To compute variable importance, we use the importance() function and plot the results:

The importance of variables in random forests

Looking at this plot, we can see that APM and ActionLatency are once again the most important features, but their order is reversed. We also see that TotalHours is now third in importance, significantly higher than what we saw before in a single tree.

We have explored the Skillcraft dataset using a number of different methods, but each time we treated this as a regression problem and measured our accuracy using the SSE. Our target variable is the league index, which tells us the gaming league in which a player competes. As such, it is actually an ordered factor.

As we've already seen before, models whose output is an ordered factor can be tricky to train as well as assess. For example, perhaps a more appropriate method of assessing our model would be to first round our model's numerical output so that we obtain a prediction on an actual player league. Then, we could assess the model using a weighted classification error rate that more heavily penalizes a predicted league index that is very far from the actual league index. We leave this as an exercise for the reader.

One of the issues that we often face when we model the problem as a regression problem is that we have no way to force the output to predict across the full range of the original levels. In our particular dataset, for example, we might never predict the lowest or highest league. For some suggestions on alternative ways to model ordered factors with regression trees, there is an insightful paper published in 2000 by Kramer and others, titled Prediction of Ordinal Classes Using Regression Trees. This appears in the 34th issue of Fundamentals Informaticae by IOS Press.

Note

For random forests, the original reference is a 2001 paper by Leo Breiman, titled Random Forests, published in the journal Machine Learning. Besides this reference, a fantastic chapter with numerous examples appears in the book Statistical Learning from a Regression Perspective, Richard A. Derk, Springer.

XGBoost

Along the same vein as AdaBoost, and with yet another twist on gradient boosting, is Extreme Gradient Boosting (XGBoost). XGBoost is a library of functions designed and optimized specifically for boosting trees algorithms. It is a generally advanced tool kit that yields impressive results, but does takes some time to understand.

XGBoost is based upon the quite well known gradient boosting framework, but it is more efficient. Specifically, XGBoost leverages system optimization concepts such as out of core computations, parallelization, cache optimization, and distributed computing to create a faster and more flexible tool for learning tree ensembles.

Note

Basically, XGBoost is built to perform the sequential process of boosting by utilizing all cores of a machine as it recursively divides data into parts, retaining the first part to be used as the test data, then reintegrating the first part to the dataset and retaining the second part, do a training and repeat, and so on.

In addition, XGBoost has many parameters that can be customized and is extendable and therefore is much more flexible. Other advantages offered by XGBoost include regularization, customizable parameters, deeper tree pruning, and built-in cross validation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset