Improving model performance with meta-learning

As an alternative to increasing the performance of a single model, it is possible to combine several models to form a powerful team. Just as the best sports teams have players with complementary rather than overlapping skillsets, some of the best machine learning algorithms utilize teams of complementary models. Since a model brings a unique bias to a learning task, it may readily learn one subset of examples, but have trouble with another. Therefore, by intelligently using the talents of several diverse team members, it is possible to create a strong team of multiple weak learners.

This technique of combining and managing the predictions of multiple models falls into a wider set of meta-learning methods defining techniques that involve learning how to learn. This includes anything from simple algorithms that gradually improve performance by iterating over design decisions—for instance, the automated parameter tuning used earlier in this chapter—to highly complex algorithms that use concepts borrowed from evolutionary biology and genetics for self-modifying and adapting to learning tasks.

For the remainder of this chapter, we'll focus on meta-learning only as it pertains to modeling a relationship between the predictions of several models and the desired outcome. The teamwork-based techniques covered here are quite powerful, and are used quite often to build more effective classifiers.

Understanding ensembles

Suppose you were a contestant on a television trivia show that allowed you to choose a panel of five friends to assist you with answering the final question for the million-dollar prize. Most people would try to stack the panel with a diverse set of subject matter experts. A panel containing professors of literature, science, history, and art, along with a current pop-culture expert would be a safely well-rounded group. Given their breadth of knowledge, it would be unlikely to find a question that stumps the group.

The meta-learning approach that utilizes a similar principle of creating a varied team of experts is known as an ensemble. All the ensemble methods are based on the idea that by combining multiple weaker learners, a stronger learner is created. The various ensemble methods can be distinguished, in large part, by the answers to these two questions:

  • How are the weak learning models chosen and/or constructed?
  • How are the weak learners' predictions combined to make a single final prediction?

When answering these questions, it can be helpful to imagine the ensemble in terms of the following process diagram; nearly all ensemble approaches follow this pattern:

Understanding ensembles

First, input training data is used to build a number of models. The allocation function dictates how much of the training data each model receives. Do they each receive the full training dataset or merely a sample? Do they each receive every feature or a subset?

Although the ideal ensemble includes a diverse set of models, the allocation function can increase diversity by artificially varying the input data to bias the resulting learners, even if they are the same type. For instance, it might use bootstrap sampling to construct unique training datasets or pass on a different subset of features or examples to each model. On the other hand, if the ensemble already includes a diverse set of algorithms—such as a neural network, a decision tree, and a k-NN classifier—the allocation function might pass the data on to each algorithm relatively unchanged.

After the models are constructed, they can be used to generate a set of predictions, which must be managed in some way. The combination function governs how disagreements among the predictions are reconciled. For example, the ensemble might use a majority vote to determine the final prediction, or it could use a more complex strategy such as weighting each model's votes based on its prior performance.

Some ensembles even utilize another model to learn a combination function from various combinations of predictions. For example, suppose that when M1 and M2 both vote yes, the actual class value is usually no. In this case, the ensemble could learn to ignore the vote of M1 and M2 when they agree. This process of using the predictions of several models to train a final arbiter model is known as stacking.

Understanding ensembles

One of the benefits of using ensembles is that they may allow you to spend less time in pursuit of a single best model. Instead, you can train a number of reasonably strong candidates and combine them. Yet, convenience isn't the only reason why ensemble-based methods continue to rack up wins in machine learning competitions; ensembles also offer a number of performance advantages over single models:

  • Better generalizability to future problems: As the opinions of several learners are incorporated into a single final prediction, no single bias is able to dominate. This reduces the chance of overfitting to a learning task.
  • Improved performance on massive or miniscule datasets: Many models run into memory or complexity limits when an extremely large set of features or examples are used, making it more efficient to train several small models than a single full model. Conversely, ensembles also do well on the smallest datasets because resampling methods such as bootstrapping are inherently a part of many ensemble designs. Perhaps most importantly, it is often possible to train an ensemble in parallel using distributed computing methods.
  • The ability to synthesize data from distinct domains: Since there is no one-size-fits-all learning algorithm, the ensemble's ability to incorporate evidence from multiple types of learners is increasingly important as complex phenomena rely on data drawn from diverse domains.
  • A more nuanced understanding of difficult learning tasks: Real-world phenomena are often extremely complex with many interacting intricacies. Models that divide the task into smaller portions are likely to more accurately capture subtle patterns that a single global model might miss.

None of these benefits would be very helpful if you weren't able to easily apply ensemble methods in R, and there are many packages available to do just that. Let's take a look at several of the most popular ensemble methods and how they can be used to improve the performance of the credit model we've been working on.

Bagging

One of the first ensemble methods to gain widespread acceptance used a technique called bootstrap aggregating or bagging for short. As described by Leo Breiman in 1994, bagging generates a number of training datasets by bootstrap sampling the original training data. These datasets are then used to generate a set of models using a single learning algorithm. The models' predictions are combined using voting (for classification) or averaging (for numeric prediction).

Note

For additional information on bagging, refer to Breiman L. Bagging predictors. Machine Learning. 1996; 24:123-140.

Although bagging is a relatively simple ensemble, it can perform quite well as long as it is used with relatively unstable learners, that is, those generating models that tend to change substantially when the input data changes only slightly. Unstable models are essential in order to ensure the ensemble's diversity in spite of only minor variations between the bootstrap training datasets. For this reason, bagging is often used with decision trees, which have the tendency to vary dramatically given minor changes in the input data.

The ipred package offers a classic implementation of bagged decision trees. To train the model, the bagging() function works similar to many of the models used previously. The nbagg parameter is used to control the number of decision trees voting in the ensemble (with a default value of 25). Depending on the difficulty of the learning task and the amount of training data, increasing this number may improve the model's performance up to a limit. The downside is that this comes at the expense of additional computational expense because a large number of trees may take some time to train.

After installing the ipred package, we can create the ensemble as follows. We'll stick to the default value of 25 decision trees:

> library(ipred)
> set.seed(300)
> mybag <- bagging(default ~ ., data = credit, nbagg = 25)

The resulting model works as expected with the predict() function:

> credit_pred <- predict(mybag, credit)
> table(credit_pred, credit$default)
           
credit_pred  no yes
        no  699   2
        yes   1 298

Given the preceding results, the model seems to have fit the training data extremely well. To see how this translates into future performance, we can use the bagged trees with 10-fold CV using the train() function in the caret package. Note that the method name for the ipred bagged trees function is treebag:

> library(caret)
> set.seed(300)
> ctrl <- trainControl(method = "cv", number = 10)
> train(default ~ ., data = credit, method = "treebag",
         trControl = ctrl)

Bagged CART 

1000 samples
  16 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (10 fold) 

Summary of sample sizes: 900, 900, 900, 900, 900, 900, ... 

Resampling results

  Accuracy  Kappa      Accuracy SD  Kappa SD  
  0.735     0.3297726  0.03439961   0.08590462

The kappa statistic of 0.33 for this model suggests that the bagged tree model performs at least as well as the best C5.0 decision tree we tuned earlier in this chapter. This illustrates the power of ensemble methods; a set of simple learners working together can outperform very sophisticated models.

To get beyond bags of decision trees, the caret package also provides a more general bag() function. It includes native support for a handful of models, though it can be adapted to other types with a bit of additional effort. The bag() function uses a control object to configure the bagging process. It requires the specification of three functions: one for fitting the model, one for making predictions, and one for aggregating the votes.

For example, suppose we wanted to create a bagged support vector machine model, using the ksvm() function in the kernlab package we used in Chapter 7, Black Box Methods – Neural Networks and Support Vector Machines. The bag() function requires us to provide functionality for training the SVMs, making predictions, and counting votes.

Rather than writing these ourselves, the caret package's built-in svmBag list object supplies three functions we can use for this purpose:

> str(svmBag)
List of 3
 $ fit      :function (x, y, ...)  
 $ pred     :function (object, x)  
 $ aggregate:function (x, type = "class")

By looking at the svmBag$fit function, we see that it simply calls the ksvm() function from the kernlab package and returns the result:

> svmBag$fit
function (x, y, ...) 
{
    library(kernlab)
    out <- ksvm(as.matrix(x), y, prob.model = is.factor(y), ...)
    out
}
<environment: namespace:caret>

The pred and aggregate functions for svmBag are also similarly straightforward. By studying these functions and creating your own in the same format, it is possible to use bagging with any machine learning algorithm you would like.

Tip

The caret package also includes example objects for bags of naive Bayes models (nbBag), decision trees (ctreeBag), and neural networks (nnetBag).

Applying the three functions in the svmBag list, we can create a bagging control object:

> bagctrl <- bagControl(fit = svmBag$fit, 
                        predict = svmBag$pred,
                        aggregate = svmBag$aggregate)

By using this with the train() function and the training control object (ctrl), defined earlier, we can evaluate the bagged SVM model as follows (note that the kernlab package is required for this to work; you will need to install it if you have not done so previously):

> set.seed(300)
> svmbag <- train(default ~ ., data = credit, "bag",
                  trControl = ctrl, bagControl = bagctrl)
> svmbag

Bagged Model
1000 samples
  16 predictors
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validation (10 fold) 

Summary of sample sizes: 900, 900, 900, 900, 900, 900, ... 

Resampling results

  Accuracy  Kappa      Accuracy SD  Kappa SD 
  0.728     0.2929505  0.04442222   0.1318101

Tuning parameter 'vars' was held constant at a value of 35

Given that the kappa statistic is below 0.30, it seems that the bagged SVM model performs worse than the bagged decision tree model. It's worth pointing out that the standard deviation of the kappa statistic is fairly large compared to the bagged decision tree model. This suggests that the performance varies substantially among the folds in the cross-validation. Such variation may imply that the performance might be improved further by upping the number of models in the ensemble.

Boosting

Another common ensemble-based method is called boosting because it boosts the performance of weak learners to attain the performance of stronger learners. This method is based largely on the work of Robert Schapire and Yoav Freund, who have published extensively on the topic.

Note

For additional information on boosting, refer to Schapire RE, Freund Y. Boosting: Foundations and Algorithms. Cambridge, MA, The MIT Press; 2012.

Similar to bagging, boosting uses ensembles of models trained on resampled data and a vote to determine the final prediction. There are two key distinctions. First, the resampled datasets in boosting are constructed specifically to generate complementary learners. Second, rather than giving each learner an equal vote, boosting gives each learner's vote a weight based on its past performance. Models that perform better have greater influence over the ensemble's final prediction.

Boosting will result in performance that is often quite better and certainly no worse than the best of the models in the ensemble. Since the models in the ensemble are built to be complementary, it is possible to increase ensemble performance to an arbitrary threshold simply by adding additional classifiers to the group, assuming that each classifier performs better than random chance. Given the obvious utility of this finding, boosting is thought to be one of the most significant discoveries in machine learning.

Tip

Although boosting can create a model that meets an arbitrarily low error rate, this may not always be reasonable in practice. For one, the performance gains are incrementally smaller as additional learners are added, making some thresholds practically infeasible. Additionally, the pursuit of pure accuracy may result in the model being overfitted to the training data and not generalizable to unseen data.

A boosting algorithm called AdaBoost or adaptive boosting was proposed by Freund and Schapire in 1997. The algorithm is based on the idea of generating weak learners that iteratively learn a larger portion of the difficult-to-classify examples by paying more attention (that is, giving more weight) to frequently misclassified examples.

Beginning from an unweighted dataset, the first classifier attempts to model the outcome. Examples that the classifier predicted correctly will be less likely to appear in the training dataset for the following classifier, and conversely, the difficult-to-classify examples will appear more frequently. As additional rounds of weak learners are added, they are trained on data with successively more difficult examples. The process continues until the desired overall error rate is reached or performance no longer improves. At that point, each classifier's vote is weighted according to its accuracy on the training data on which it was built.

Though boosting principles can be applied to nearly any type of model, the principles are most commonly used with decision trees. We already used boosting in this way in Chapter 5, Divide and Conquer – Classification Using Decision Trees and Rules, as a method to improve the performance of a C5.0 decision tree.

The AdaBoost.M1 algorithm provides another tree-based implementation of AdaBoost for classification. The AdaBoost.M1 algorithm can be found in the adabag package.

Note

For more information about the adabag package, refer to Alfaro E, Gamez M, Garcia N. adabag – an R package for classification with boosting and bagging. Journal of Statistical Software. 2013; 54:1-35.

Let's create an AdaBoost.M1 classifier for the credit data. The general syntax for this algorithm is similar to other modeling techniques:

> set.seed(300)
> m_adaboost <- boosting(default ~ ., data = credit)

As usual, the predict() function is applied to the resulting object to make predictions:

> p_adaboost <- predict(m_adaboost, credit)

Departing from convention, rather than returning a vector of predictions, this returns an object with information about the model. The predictions are stored in a sub-object called class:

> head(p_adaboost$class)
[1] "no"  "yes" "no"  "no"  "yes" "no"

A confusion matrix can be found in the confusion sub-object:

> p_adaboost$confusion
               Observed Class
Predicted Class  no yes
            no  700   0
            yes   0 300

Did you notice that the AdaBoost model made no mistakes? Before you get your hopes up, remember that the preceding confusion matrix is based on the model's performance on the training data. Since boosting allows the error rate to be reduced to an arbitrarily low level, the learner simply continued until it made no more errors. This likely resulted in overfitting on the training dataset.

For a more accurate assessment of performance on unseen data, we need to use another evaluation method. The adabag package provides a simple function to use 10-fold CV:

> set.seed(300)
> adaboost_cv <- boosting.cv(default ~ ., data = credit)

Depending on your computer's capabilities, this may take some time to run, during which it will log each iteration to screen. After it completes, we can view a more reasonable confusion matrix:

> adaboost_cv$confusion
               Observed Class
Predicted Class  no yes
            no  594 151
            yes 106 149

We can find the kappa statistic using the vcd package as described in Chapter 10, Evaluating Model Performance.

> library(vcd)
> Kappa(adaboost_cv$confusion)
               value       ASE
Unweighted 0.3606965 0.0323002
Weighted   0.3606965 0.0323002

With a kappa of about 0.36, this is our best-performing credit scoring model yet. Let's see how it compares to one last ensemble method.

Tip

The AdaBoost.M1 algorithm can be tuned in caret by specifying method = "AdaBoost.M1".

Random forests

Another ensemble-based method called random forests (or decision tree forests) focuses only on ensembles of decision trees. This method was championed by Leo Breiman and Adele Cutler, and combines the base principles of bagging with random feature selection to add additional diversity to the decision tree models. After the ensemble of trees (the forest) is generated, the model uses a vote to combine the trees' predictions.

Note

For more detail on how random forests are constructed, refer to Breiman L. Random Forests. Machine Learning. 2001; 45:5-32.

Random forests combine versatility and power into a single machine learning approach. As the ensemble uses only a small, random portion of the full feature set, random forests can handle extremely large datasets, where the so-called "curse of dimensionality" might cause other models to fail. At the same time, its error rates for most learning tasks are on par with nearly any other method.

Tip

Although the term "Random Forests" is trademarked by Breiman and Cutler, the term is sometimes used colloquially to refer to any type of decision tree ensemble. A pedant would use the more general term "decision tree forests" except when referring to the specific implementation by Breiman and Cutler.

It's worth noting that relative to other ensemble-based methods, random forests are quite competitive and offer key advantages relative to the competition. For instance, random forests tend to be easier to use and less prone to overfitting. The following table lists the general strengths and weaknesses of random forest models:

Strengths

Weaknesses

  • An all-purpose model that performs well on most problems
  • Can handle noisy or missing data as well as categorical or continuous features
  • Selects only the most important features
  • Can be used on data with an extremely large number of features or examples
  • Unlike a decision tree, the model is not easily interpretable
  • May require some work to tune the model to the data

Due to their power, versatility, and ease of use, random forests are quickly becoming one of the most popular machine learning methods. Later on in this chapter, we'll compare a random forest model head-to-head against the boosted C5.0 tree.

Training random forests

Though there are several packages to create random forests in R, the randomForest package is perhaps the implementation that is most faithful to the specification by Breiman and Cutler, and is also supported by caret for automated tuning. The syntax for training this model is as follows:

Training random forests

By default, the randomForest() function creates an ensemble of 500 trees that consider sqrt(p) random features at each split, where p is the number of features in the training dataset and sqrt() refers to R's square root function. Whether or not these default parameters are appropriate depends on the nature of the learning task and training data. Generally, more complex learning problems and larger datasets (either more features or more examples) work better with a larger number of trees, though this needs to be balanced with the computational expense of training more trees.

The goal of using a large number of trees is to train enough so that each feature has a chance to appear in several models. This is the basis of the sqrt(p) default value for the mtry parameter; using this value limits the features sufficiently so that substantial random variation occurs from tree-to-tree. For example, since the credit data has 16 features, each tree would be limited to splitting on four features at any time.

Let's see how the default randomForest() parameters work with the credit data. We'll train the model just as we did with other learners. Again, the set.seed() function ensures that the result can be replicated:

> library(randomForest)
> set.seed(300)
> rf <- randomForest(default ~ ., data = credit)

To look at a summary of the model's performance, we can simply type the resulting object's name:

> rf

Call:
 randomForest(formula = default ~ ., data = credit) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 4

        OOB estimate of error rate: 23.8%
Confusion matrix:
     no yes class.error
no  640  60  0.08571429
yes 178 122  0.59333333

The output notes that the random forest included 500 trees and tried four variables at each split, just as we expected. At first glance, you might be alarmed at the seemingly poor performance according to the confusion matrix—the error rate of 23.8 percent is far worse than the resubstitution error of any of the other ensemble methods so far. However, this confusion matrix does not show resubstitution error. Instead, it reflects the out-of-bag error rate (listed in the output as OOB estimate of error rate), which unlike resubstitution error, is an unbiased estimate of the test set error. This means that it should be a fairly reasonable estimate of future performance.

The out-of-bag estimate is computed during the construction of the random forest. Essentially, any example not selected for a single tree's bootstrap sample can be used to test the model's performance on unseen data. At the end of the forest construction, the predictions for each example each time it was held out are tallied, and a vote is taken to determine the final prediction for the example. The total error rate of such predictions becomes the out-of-bag error rate.

Evaluating random forest performance

As mentioned previously, the randomForest() function is supported by caret, which allows us to optimize the model while, at the same time, calculating performance measures beyond the out-of-bag error rate. To make things interesting, let's compare an auto-tuned random forest to the best auto-tuned boosted C5.0 model we've developed. We'll treat this experiment as if we were hoping to identify a candidate model for submission to a machine learning competition.

We must first load caret and set our training control options. For the most accurate comparison of model performance, we'll use repeated 10-fold cross-validation, or 10-fold CV repeated 10 times. This means that the models will take a much longer time to build and will be more computationally intensive to evaluate, but since this is our final comparison we should be very sure that we're making the right choice; the winner of this showdown will be our only entry into the machine learning competition.

> library(caret)
> ctrl <- trainControl(method = "repeatedcv",
                       number = 10, repeats = 10)

Next, we'll set up the tuning grid for the random forest. The only tuning parameter for this model is mtry, which defines how many features are randomly selected at each split. By default, we know that the random forest will use sqrt(16), or four features per tree. To be thorough, we'll also test values half of that, twice that, as well as the full set of 16 features. Thus, we need to create a grid with values of 2, 4, 8, and 16 as follows:

> grid_rf <- expand.grid(.mtry = c(2, 4, 8, 16))

Tip

A random forest that considers the full set of features at each split is essentially the same as a bagged decision tree model.

We can supply the resulting grid to the train() function with the ctrl object as follows. We'll use the kappa metric to select the best model:

> set.seed(300)
> m_rf <- train(default ~ ., data = credit, method = "rf",
                metric = "Kappa", trControl = ctrl,
                tuneGrid = grid_rf)

The preceding command may take some time to complete as it has quite a bit of work to do! When it finishes, we'll compare that to a boosted tree using 10, 20, 30, and 40 iterations:

> grid_c50 <- expand.grid(.model = "tree",
                          .trials = c(10, 20, 30, 40),
                          .winnow = "FALSE")
> set.seed(300)
> m_c50 <- train(default ~ ., data = credit, method = "C5.0",
                 metric = "Kappa", trControl = ctrl,
                 tuneGrid = grid_c50)

When the C5.0 decision tree finally completes, we can compare the two approaches side-by-side. For the random forest model, the results are:

> m_rf

Resampling results across tuning parameters:

  mtry  Accuracy  Kappa      Accuracy SD  Kappa SD  
   2    0.7247    0.1284142  0.01690466   0.06364740
   4    0.7499    0.2933332  0.02989865   0.08768815
   8    0.7539    0.3379986  0.03107160   0.08353988
  16    0.7556    0.3613151  0.03379439   0.08891300

For the boosted C5.0 model, the results are:

> m_c50

Resampling results across tuning parameters:

  trials  Accuracy  Kappa      Accuracy SD  Kappa SD  
  10      0.7325    0.3215655  0.04021093   0.09519817
  20      0.7343    0.3268052  0.04033333   0.09711408
  30      0.7381    0.3343137  0.03672709   0.08942323
  40      0.7388    0.3335082  0.03934514   0.09746073

With a kappa of about 0.361, the random forest model with mtry = 16 was the winner among these eight models. It was higher than the best C5.0 decision tree, which had a kappa of about 0.334, and slightly higher than the AdaBoost.M1 model with a kappa of about 0.360. Based on these results, we would submit the random forest as our final model. Without actually evaluating the model on the competition data, we have no way of knowing for sure whether it will end up winning, but given our performance estimates, it's the safer bet. With a bit of luck, perhaps we'll come away with the prize.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset