Improving model performance with meta-learning

As an alternative to increasing the performance of a single model, it is possible to combine several models to form a powerful team. Just as the best sports teams have players with complementary rather than overlapping skillsets some of the best machine learning algorithms utilize teams of complementary models. Because a model brings a unique bias to a learning task, it may readily learn one subset of examples but have trouble with another. Therefore, by intelligently using the talents of several diverse team members, it is possible to create a strong team of multiple weak learners.

This technique of combining and managing the predictions of multiple models falls within a wider set of meta-learning methods that broadly encompass any technique that involves learning how to learn. This might include anything from simple algorithms that gradually improve performance by automatically iterating over design decisions—for instance, the automated parameter tuning used earlier in this chapter—to highly complex algorithms that use concepts borrowed from evolutionary biology and genetics for self-modifying and adapting to learning tasks.

For the remainder of this chapter, we'll focus on meta-learning only as it pertains to modeling a relationship between the predictions of several models and the desired outcome. The teamwork-based techniques covered here are quite powerful, and are used quite often to build more effective classifiers.

Understanding ensembles

Suppose you were a contestant on a television trivia show that allowed you to choose a panel of five friends to assist you with answering the final question for the million-dollar prize. Most people would try to stack the panel with a diverse set of subject-matter experts. For instance, a panel containing professors of literature, science, history, and art, along with a current pop-culture expert would be a safely well-rounded group. Given their breadth of knowledge, it would be unlikely to find a question that stumps the panel.

The meta-learning approach that utilizes a similar principle of creating a varied team of experts is known as an ensemble. All ensemble methods are based on the idea that by combining multiple weaker learners, a stronger learner is created. Using this simple principle, a large variety of algorithms has been developed distinguished largely by two questions:

  • How are the weak learning models chosen and/or constructed?
  • How are the weak learners' predictions combined to make a single final prediction?

When answering these questions, it can be helpful to imagine the ensemble in terms of the process diagram as follows; nearly all ensemble approaches follow this pattern.

Understanding ensembles

First, input training data is used to build a number of models. The allocation function dictates whether each model receives the full training dataset or merely a sample. Since the ideal ensemble includes a diverse set of models, the allocation function could increase diversity by artificially varying the input data to train a variety of learners. For instance, it might use bootstrap sampling to construct unique training datasets or pass on a different subset of features or examples to each model. On the other hand, if the ensemble already includes a diverse set of algorithms—such as a neural network, a decision tree, and a kNN classifier—then the allocation function might pass on the data relatively unchanged.

After the models are constructed, they can be used to generate a set of predictions, which must be managed in some way. The combination function governs how disagreements among the predictions are reconciled. For example, the ensemble might use a majority vote to determine the final prediction, or it could use a more complex strategy such as weighting each model's votes based on its prior performance.

Some ensembles even utilize another model to learn a combination function from various combinations of predictions. For example, when M1 and M2 both vote yes the actual class value is usually no, then the ensemble might ignore the votes of M1 and M2 and instead predict no. This process of using the predictions of several models to train a final arbiter model is known as stacking.

One of the benefits of using ensembles is that they may allow you to spend less time in pursuit of a single best model. Instead, you can train a number of reasonably strong candidates and combine them. Yet convenience isn't the only reason why ensemble-based methods continue to rack up wins in machine learning competitions; ensembles also offer a number of performance advantages over single models:

  • Better generalizability to future problems: Because the opinions of several learners are incorporated into a single final prediction, no single bias is able to dominate. This reduces the chance of overfitting to a learning task.
  • Improved performance on massive or miniscule datasets: Many models run into memory or complexity limits when an extremely large set of features or examples are used, making it more efficient to train several small models than a single full model. Additionally, it is often trivial to parallelize an ensemble using distributed computing methods. Conversely, ensembles also do well on the smallest datasets because resampling methods like bootstrapping are inherently part of many ensemble designs.
  • The ability to synthesize of data from distinct domains: Since there is no one-size-fits-all learning algorithm—recall the No Free Lunch theorem—the ensemble's ability to incorporate evidence from multiple types of learners is increasingly important as Big Data continues to draw from disparate domains.
  • A more nuanced understanding of difficult learning tasks: Real-world phenomena are often extremely complex with many interacting intricacies. Models that divide the task into smaller portions are likely to more accurately capture subtle patterns that a single global model might miss.

None of these benefits would be very helpful if you weren't able to easily apply ensemble methods in R, and there are many packages available to do just that. Let's take a look at several of the most popular ensemble methods and how they can be used to improve the performance of the credit model we've been working on.

Bagging

One of the first ensemble methods to gain widespread acceptance used a technique called bootstrap aggregating, or bagging for short. As described by Leo Breiman in 1994, bagging generates a number of training datasets by bootstrap sampling the original training data. These datasets are then used to generate a set of models using a single learning algorithm. The models' predictions are combined using voting (for classification) or averaging (for numeric prediction).

Note

For additional information on bagging, refer to: Bagging predictors, Machine Learning, Vol. 24, pp. 123-140, by L. Breiman (1996).

Although bagging is a relatively simple ensemble, it can perform quite well as long as it is used with relatively unstable learners, that is, those generating models that tend to change substantially when the input data changes only slightly. Unstable models are essential to ensure the ensemble's diversity in spite of only minor variations between the bootstrap training datasets. For this reason, bagging is often used with decision trees, which have the tendency to vary dramatically given minor changes in input data.

The ipred package offers a classic implementation of bagged decision trees. To train the model, the bagging() function works similar to many of the models used previously. The nbagg parameter is used to control the number of decision trees voting in the ensemble (with a default value of 25). Depending on the difficulty of the learning task and the amount of training data, increasing this number may improve the model's performance, up to a limit. The downside is that this comes at the expense of additional computational expense; a large number of trees may take some time to train.

After installing the ipred package, we can create the ensemble as follows: We'll stick to the default value of 25 decision trees:

> library(ipred)
> set.seed(300)
> mybag <- bagging(default ~ ., data = credit, nbagg = 25)

The resulting model works as expected with the predict() function:

> credit_pred <- predict(mybag, credit)
> table(credit_pred, credit$default)
           
credit_pred  no yes
        no  699   2
        yes   1 298

Given the preceding results, the model seems to have fit the training data extremely well. To see how this translates into future performance, we can use the bagged trees with 10-fold CV via the train() function in the caret package. Note that the method name for the ipred bagged trees function is treebag as follows:

> library(caret)
> set.seed(300)
> ctrl <- trainControl(method = "cv", number = 10)
> train(default ~ ., data = credit, method = "treebag",
        trControl = ctrl)

1000 samples
  16 predictors
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validation (10 fold) 

Summary of sample sizes: 900, 900, 900, 900, 900, 900, ... 

Resampling results

  Accuracy  Kappa  Accuracy SD  Kappa SD
  0.735     0.33   0.0344       0.0859  

The kappa statistic of 0.33 for this model suggests that the bagged tree model performs on par with our best-tuned C5.0 decision tree.

To get beyond bags of decision trees, the caret package also provides a more general bag() function. It includes out-of-the-box support for a handful of models, though it can be adapted to more types with a bit of additional effort. The bag() function uses a control object to configure the bagging process. It requires the specification of three functions: one for fitting the model, one for making predictions, and one for aggregating the votes.

For example, suppose we wanted to create a bagged support vector machine (SVM) model, using the ksvm() function in the kernlab package we used in Chapter 7, Black Box Methods – Neural Networks and Support Vector Machines. The bag() function requires us to provide functionality for training the SVMs, making predictions, and counting votes.

Rather than writing these ourselves, the caret package's built-in svmBag list object supplies three functions we can use for this purpose:

> str(svmBag)
List of 3
 $ fit      :function (x, y, ...)  
 $ pred     :function (object, x)  
 $ aggregate:function (x, type = "class")

By looking at the svmBag$fit function, we see that it simply calls the ksvm() function from the kernlab package and returns the result:

> svmBag$fit
function (x, y, ...) 
{
    library(kernlab)
    out <- ksvm(as.matrix(x), y, prob.model = is.factor(y), ...)
    out
}
<environment: namespace:caret>

The pred and aggregate functions for svmBag are also similarly straightforward. By studying these functions and creating your own in the same format, it is possible to use bagging with any machine learning algorithm you would like.

Tip

The caret package also includes example objects for bags of naive Bayes models (nbBag), decision trees (ctreeBag), and neural networks (nnetBag).

Applying the three functions in the svmBag list, we can create a bagging control object:

> bagctrl <- bagControl(fit = svmBag$fit, 
                        predict = svmBag$pred,
                        aggregate = svmBag$aggregate)

By using this with the train() function and the training control object (ctrl) defined earlier, we can evaluate the bagged SVM model as follows. Keep in mind that the kernlab package is required for this to work; you may need to install it if you have not done so previously.

> set.seed(300)
> svmbag <- train(default ~ ., data = credit, "bag",
                  trControl = ctrl, bagControl = bagctrl)
> svmbag
1000 samples
  16 predictors
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validation (10 fold) 

Summary of sample sizes: 900, 900, 900, 900, 900, 900, ... 

Resampling results

  Accuracy  Kappa  Accuracy SD  Kappa SD
  0.728     0.293  0.0444       0.132   

Tuning parameter 'vars' was held constant at a value of 35

Given that the kappa statistic is below 0.30, it seems that the bagged SVM model performs more poorly than the bagged decision tree model. It's worth pointing out that the standard deviation of the kappa statistic (labeled Kappa SD) is fairly large compared to the bagged decision tree model. This suggests that the performance varies substantially among the folds in the cross-validation. Such variation may imply that the performance could be improved further by upping the number of models in the ensemble.

Boosting

Another popular ensemble-based method is called boosting, because it boosts the performance of weak learners to attain the performance of stronger learners. This method is based largely on the work of Rob Schapire and Yoav Freund, who have published extensively on the topic.

Note

For additional information on boosting, refer to: Boosting – Foundations and Algorithms Understanding Rule Learners by R. Schapire, and Y. Freund, (The MIT Press, 2012).

Given a number of classifiers, each with an error rate less than 50 percent; Schapire and Freund discovered that boosting will result in performance often quite better and certainly no worse than the best of these models. Essentially, this allows one to increase performance to an arbitrary threshold simply by adding more weak learners. Given the obvious utility of this finding, boosting is thought to be one of the most significant discoveries in machine learning.

Similar to bagging, boosting uses ensembles of models trained on resampled data and a vote to determine the final prediction. The key difference is that the resampled datasets in boosting are constructed specifically to generate complementary learners, and the vote is weighted based on each model's performance rather than giving each an equal vote.

A boosting algorithm called AdaBoost, or adaptive boosting, was proposed in 1997. The algorithm is based on the idea of generating weak leaners that iteratively learn a larger portion of the difficult-to-classify examples in the training data by paying more attention (that is, giving more weight) to often misclassified examples.

Beginning from an unweighted dataset, the first classifier attempts to model the outcome. Examples that the classifier predicted correctly will be less likely to appear in the training dataset for the following classifier, and conversely, the difficult-to-classify examples will appear more frequently. As additional rounds of weak learners are added, they are trained on data with successively more difficult examples. The process continues until the desired overall error rate is reached or performance no longer improves. At that point, each classifier's vote is weighted according to its accuracy on the training data on which it was built.

Though boosting principles can be applied to nearly any type of model, the principles are most commonly used with decision trees. We already used boosting in this way in Chapter 5, Divide and Conquer – Classification Using Decision Trees and Rules, as a method to improve the performance of a C5.0 decision tree.

The AdaBoost.M1 algorithm provides an alternative tree-based implementation of AdaBoost for classification. Due to its similarity to the boosted trees we created earlier, AdaBoost.M1 is not covered here.

Note

The AdaBoost.M1 algorithm can be found in the adabag R package. For more information refer to adabag – an R package for classification with boosting and bagging, Journal of Statistical Software, Vol 54(2), pp. 1-35, by E. Alfaro, M. Gamez, and N. Garcia (2013).

Random forests

Another ensemble-based method called random forests (or decision tree forests) focus only on ensembles of decision trees. This method was championed by Leo Breiman and Adele Cutler, and combines the base principles of bagging with random feature selection to add additional diversity to the decision tree models. After the ensemble of trees (the forest) is generated, the model uses a vote to combine the trees' predictions.

Note

For more detail on how random forests are constructed, refer to Random forests, Machine Learning, Vol. 45, pp. 5-32, by L. Breiman (2001).

Random forests combine versatility and power into a single machine learning approach. Because the ensemble uses only a small, random portion of the full feature set, random forests can handle extremely large datasets, where the so-called "curse of dimensionality" might cause other models to fail. At the same time, its error rates for most learning tasks are on par with nearly any other method.

Tip

Although the term "Random Forests" is trademarked by Breiman and Cutler (see http://www.stat.berkeley.edu/~breiman/RandomForests/ for details), the term is used sometimes colloquially to refer to any type of decision tree ensemble. A pedant would use the more general term "decision tree forests" except when referring to the algorithm by Breiman and Cutler.

The following table lists the general strengths and weaknesses of random forest models. It's worth noting that relative to other ensemble-based methods, random forests are quite competitive and offer key advantages relative to the competition. For instance, random forests tend to be easier to use and less prone to overfitting.

Strengths

Weaknesses

  • An all-purpose model that performs well on most problems
  • Can handle noisy or missing data; categorical or continuous features
  • Selects only the most important features
  • Can be used on data with an extremely large number of features or examples
  • Unlike a decision tree, the model is not easily interpretable
  • May require some work to tune the model to the data

Due to their power, versatility, and ease of use, random forests are quickly becoming one of the most popular machine learning methods. Later on in this chapter, we'll compare a random forest model head-to-head against the boosted C5.0 tree.

Training random forests

Though there are several packages to create random forests in R, the randomForest package is perhaps the implementation most faithful to the specification by Breiman and Cutler. An added benefit is that it is supported by caret for automated tuning. The syntax for training this model is as follows:

Training random forests

As noted previously, by default, the randomForest() function creates an ensemble of 500 trees that consider sqrt(p) random features at each split (where p is the number of features in the training dataset). Whether or not these parameters are appropriate depends on the nature of the learning task and training data. Generally, more complex learning problems and larger datasets (both more features as well as more examples) work better with a larger number of trees.

The goal of using a large number of trees is to train enough that each feature has a chance to appear in several models. This is the basis of the sqrt(p) default value for the mtry parameter; using this value limits the features sufficiently such that substantial random variation occurs from tree-to-tree. For example, since the credit data has 16 features, each tree would be limited to splitting on sqrt(16) = 4 features at any time.

Let's see how the default randomForest() parameters work with the credit data. We'll train the model just as we have done with other learners (the set.seed() function ensures that the result can be repeated).

> library(randomForest)
> set.seed(300)
> rf <- randomForest(default ~ ., data = credit)

To look at a summary of the model's performance, we can simply type the resulting object's name:

> rf

Call:
 randomForest(formula = default ~ ., data = credit) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 4

        OOB estimate of error rate: 23.8%
Confusion matrix:
     no yes class.error
no  640  60  0.08571429
yes 178 122  0.59333333 

As expected, the output notes that the random forest included 500 trees and tried 4 variables at each split. You might be alarmed at the seemingly poor resubstitution error according to the display confusion matrix—the error rate of 23.8 percent is far worse than any of the other ensemble methods so far. In fact, this confusion matrix is not resubstitution error at all. Instead, it reflects the out-of-bag error rate (labeled OOB estimate of error rate), which is an unbiased estimate of the test set error. This means that it should be a fairly reasonable estimate of future performance.

The out-of-bag estimate is computed during the construction of the random forest. Essentially, any example not selected for a single tree's bootstrap sample can be used as a way to test the model's performance on unseen data. At the end of the forest construction, the predictions for each example each time it was held out are tallied, and a vote is taken to determine the final prediction for the example. The total error rate of such predictions becomes the out-of-bag error rate.

Evaluating random forest performance

As mentioned previously, the randomForest() function is also supported by caret, which allows us to optimize the model while at the same time calculating performance measures beyond the out-of-bag error rate. To make things interesting, let's compare an auto-tuned random forest to the best auto-tuned boosted C5.0 model we've been working on. We'll treat this experiment as if we were hoping to identify a candidate model for submission to a machine learning competition.

We must first load caret and set our training control options. For the most accurate comparison of model performance, we'll use repeated 10-fold cross-validation: 10 times 10-fold CV. While this means that the models will take a much longer time and be more computationally intensive to evaluate; since this is our final comparison, we should be very sure that we're making the right choice—the winner of this showdown will be our only entry into the machine learning competition.

> library(caret)
> ctrl <- trainControl(method = "repeatedcv",
                       number = 10, repeats = 10)

Next, we'll set up the tuning grid for the random forest. The only tuning parameter for this model is mtry, which defines how many features are randomly selected at each split. By default, we know that the random forest will use sqrt(16) = 4 features. To be thorough, we'll also test values half of that, twice that, as well as the full set of features. Thus, we need to create a grid with values of 2, 4, 8, and 16 as follows:

> grid_rf <- expand.grid(.mtry = c(2, 4, 8, 16))

Tip

A random forest that considers the full set of features at each split is essentially the same as a bagged decision tree model.

We can supply the resulting grid to the train() function with the ctrl object as follows. We'll use the kappa metric to select the best model.

> set.seed(300)
> m_rf <- train(default ~ ., data = credit, method = "rf",
                metric = "Kappa", trControl = ctrl,
                tuneGrid = grid_rf)

The preceding command may take some time to complete as it has quite a bit of work to do! When it finishes, we'll compare that to a boosted tree using 10, 20, 30, and 40 iterations:

> grid_c50 <- expand.grid(.model = "tree",
                          .trials = c(10, 20, 30, 40),
                          .winnow = "FALSE")
> set.seed(300)
> m_c50 <- train(default ~ ., data = credit, method = "C5.0",
                 metric = "Kappa", trControl = ctrl,
                 tuneGrid = grid_c50)

When the C5.0 decision tree finally completes, we can compare the two approaches side-by-side. For the random forest model the results are:

> m_rf

Resampling results across tuning parameters:

  mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
  2     0.725     0.128  0.0169       0.0636  
  4     0.75      0.293  0.0299       0.0877  
  8     0.754     0.338  0.0311       0.0835  
  16    0.756     0.361  0.0338       0.0889  

For the boosted C5.0 model the results are:

> m_c50

Resampling results across tuning parameters:

  trials  Accuracy  Kappa  Accuracy SD  Kappa SD
  10      0.732     0.322  0.0402       0.0952  
  20      0.734     0.327  0.0403       0.0971  
  30      0.738     0.334  0.0367       0.0894  
  40      0.739     0.334  0.0393       0.0975  

With a kappa of 0.361, the random forest model with mtry = 16 was the winner among these eight models. It was marginally higher than the best C5.0 decision tree, which had a kappa of 0.334. Based on these results, we would submit the random forest as our final model. Without actually evaluating the model on the competition data, we have no way of knowing for sure whether it will end up winning; but given our performance estimates, it's the safer bet. With a bit of luck, perhaps we'll come away with the prize.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset