Improving model performance with meta-learning

As an alternative to increasing the performance of a single model, it is possible to combine several models to form a powerful team. Just as the best sports teams have players with complementary rather than overlapping skillsets, some of the best machine learning algorithms utilize teams of complementary models. Since a model brings a unique bias to a learning task, it may readily learn one subset of examples, but have trouble with another. Therefore, by intelligently using the talents of several diverse team members, it is possible to create a strong team of multiple weak learners.

This technique of combining and managing the predictions of multiple models falls into a wider set of meta-learning methods, which are techniques that involve learning how to learn. This includes anything from simple algorithms that gradually improve performance by iterating over design decisions—for instance, the automated parameter tuning used earlier in this chapter—to highly complex algorithms that use concepts borrowed from evolutionary biology and genetics for self-modifying and adapting to learning tasks.

For the remainder of this chapter, we'll focus on meta-learning only as it pertains to modeling a relationship between the predictions of several models and the desired outcome. The teamwork-based techniques covered here are quite powerful and are used quite often to build more effective classifiers.

Understanding ensembles

Suppose you were a contestant on a television trivia show that allowed you to choose a panel of five friends to assist you with answering the final question for the million-dollar prize. Most people would try to stack the panel with a diverse set of subject matter experts. A panel containing professors of literature, science, history, and art, along with a current pop-culture expert would be safely well rounded. Given their breadth of knowledge, it would be unlikely that a question would stump the group.

The meta-learning approach that utilizes a similar principle of creating a varied team of experts is known as an ensemble. All ensemble methods are based on the idea that by combining multiple weaker learners, a stronger learner is created. The various ensemble methods can be distinguished, in large part, by the answers to two questions:

  • How are the weak learning models chosen and/or constructed?
  • How are the weak learners' predictions combined to make a single final prediction?

When answering these questions, it can be helpful to imagine the ensemble in terms of the following process diagram; nearly all ensemble approaches follow this pattern:

Understanding ensembles

Figure 11.1: Ensembles combine multiple weaker models into a single stronger model

First, input training data is used to build a number of models. The allocation function dictates how much of the training data each model receives. Do they each receive the full training dataset or merely a sample? Do they each receive every feature or a subset?

Although the ideal ensemble includes a diverse set of models, the allocation function can increase diversity by artificially varying the input data to bias the resulting learners, even if they are the same type. For instance, in an ensemble of decision trees, the allocation function might use bootstrap sampling to construct unique training datasets for each tree, or it may pass each one a different subset of features. On the other hand, if the ensemble already includes a diverse set of algorithms—such as a neural network, a decision tree, and a k-NN classifier—then the allocation function might pass the data on to each algorithm relatively unchanged.

After the ensemble's models are constructed, they can be used to generate a set of predictions, which must be managed in some way. The combination function governs how disagreements among the predictions are reconciled. For example, the ensemble might use a majority vote to determine the final prediction, or it could use a more complex strategy such as weighting each model's votes based on its prior performance.

Some ensembles even utilize another model to learn a combination function from various combinations of predictions. For example, suppose that when M1 and M2 both vote yes, the actual class value is usually no. In this case, the ensemble could learn to ignore the vote of M1 and M2 when they agree. This process of using the predictions of several models to train a final arbiter model is known as stacking.

Understanding ensembles

Figure 11.2: Stacking is a sophisticated ensemble that uses a learning algorithm to combine predictions

One of the benefits of using ensembles is that they may allow you to spend less time in pursuit of a single best model. Instead, you can train a number of reasonably strong candidates and combine them. Yet convenience isn't the only reason why ensemble-based methods continue to rack up wins in machine learning competitions; ensembles also offer a number of performance advantages over single models:

  • Better generalizability to future problems: As the opinions of several learners are incorporated into a single final prediction, no single bias is able to dominate. This reduces the chance of overfitting to a learning task.
  • Improved performance on massive or miniscule datasets: Many models run into memory or complexity limits when an extremely large set of features or examples are used, making it more efficient to train several small models than a single full model. Conversely, ensembles also do well on the smallest datasets because resampling methods like bootstrapping are inherently part of many ensemble designs. Perhaps most importantly, it is often possible to train an ensemble in parallel using distributed computing methods.
  • The ability to synthesize data from distinct domains: Since there is no one-size-fits-all learning algorithm, the ensemble's ability to incorporate evidence from multiple types of learners is increasingly important as complex phenomena rely on data drawn from diverse domains.
  • A more nuanced understanding of difficult learning tasks: Real-world phenomena are often extremely complex, with many interacting intricacies. Models that divide the task into smaller portions are likely to more accurately capture subtle patterns that a single global model might miss.

None of these benefits would be very helpful if you weren't able to easily apply ensemble methods in R, and there are many packages available to do just that. Let's take a look at several of the most popular ensemble methods and how they can be used to improve the performance of the credit model we've been working on.

Bagging

One of the first ensemble methods to gain widespread acceptance used a technique called bootstrap aggregating, or bagging for short. As described by Leo Breiman in 1994, bagging generates a number of training datasets by bootstrap sampling the original training data. These datasets are then used to generate a set of models using a single learning algorithm. The models' predictions are combined using voting (for classification) or averaging (for numeric prediction).

Note

For additional information on bagging, refer to Bagging predictors, Breiman, L, Machine Learning, 1996, Vol. 24, pp. 123-140.

Although bagging is a relatively simple ensemble, it can perform quite well as long as it is used with relatively unstable learners, that is, those generating models that tend to change substantially when the input data changes only slightly. Unstable models are essential in order to ensure the ensemble's diversity in spite of only minor variations between the bootstrap training datasets. For this reason, bagging is often used with decision trees, which have the tendency to vary dramatically given minor changes in input data.

The ipred package offers a classic implementation of bagged decision trees. To train the model, the bagging() function works similarly to many of the models used previously. The nbagg parameter is used to control the number of decision trees voting in the ensemble (with a default value of 25). Depending on the difficulty of the learning task and the amount of training data, increasing this number may improve the model's performance, up to a limit. The downside is that this creates additional computational expense, and a large number of trees may take some time to train.

After installing the ipred package, we can create the ensemble as follows. We'll stick to the default value of 25 decision trees:

> library(ipred)
> RNGversion("3.5.2")
> set.seed(300)
> mybag <- bagging(default ~ ., data = credit, nbagg = 25)

The resulting model works as expected with the predict() function:

> credit_pred <- predict(mybag, credit)
> table(credit_pred, credit$default)
           
credit_pred  no yes
        no  699   2
        yes   1 298

Given the preceding results, the model seems to have fit the training data extremely well. To see how this translates into future performance, we can use the bagged trees with 10-fold CV using the train() function in the caret package. Note that the method name for the ipred bagged trees function is treebag:

> library(caret)
> RNGversion("3.5.2")
> set.seed(300)
> ctrl <- trainControl(method = "cv", number = 10)
> train(default ~ ., data = credit, method = "treebag",
         trControl = ctrl)

Bagged CART

1000 samples
  16 predictor
   2 classes: 'no', 'yes'

No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 900, 900, 900, 900, 900, 900, ...
Resampling results:

  Accuracy  Kappa    
  0.746     0.3540389

The kappa statistic of 0.35 for this model suggests that the bagged tree model performs at least as well as the best C5.0 decision tree we tuned earlier in this chapter, which had a 0.32 kappa statistic. This illustrates the power of ensemble methods; a set of simple learners, working together, can outperform very sophisticated models.

Boosting

Another common ensemble-based method is called boosting, because it boosts the performance of weak learners to attain the performance of stronger learners. This method is based largely on the work of Robert Schapire and Yoav Freund, who have published extensively on the topic.

Note

For additional information on boosting, refer to Boosting: Foundations and Algorithms, Schapire, RE, Freund, Y, Cambridge, MA: The MIT Press, 2012.

Similar to bagging, boosting uses ensembles of models trained on resampled data and a vote to determine the final prediction. There are two key distinctions. First, the resampled datasets in boosting are constructed specifically to generate complementary learners. Second, rather than giving each learner an equal vote, boosting gives each learner's vote a weight based on its past performance. Models that perform better have greater influence over the ensemble's final prediction.

Boosting will result in performance that is often better and certainly no worse than the best of the models in the ensemble. Since the models in the ensemble are built to be complementary, it is possible to increase ensemble performance to an arbitrary threshold simply by adding additional classifiers to the group, assuming that each additional classifier performs better than random chance. Given the obvious utility of this finding, boosting is thought to be one of the most significant discoveries in machine learning.

Tip

Although boosting can create a model that meets an arbitrarily low error rate, this may not always be reasonable in practice. For one, the performance gains are incrementally smaller as additional learners are gained, making some thresholds practically infeasible. Additionally, the pursuit of pure accuracy may result in the model being overfitted to the training data and not generalizable to unseen data.

A boosting algorithm called AdaBoost, or adaptive boosting, was proposed by Freund and Schapire in 1997. The algorithm is based on the idea of generating weak leaners that iteratively learn a larger portion of the difficult-to-classify examples in the training data by paying more attention (that is, giving more weight) to often misclassified examples.

Beginning from an unweighted dataset, the first classifier attempts to model the outcome. Examples that the classifier predicted correctly will be less likely to appear in the training dataset for the following classifier, and conversely, the difficult-to-classify examples will appear more frequently.

As additional rounds of weak learners are added, they are trained on data with successively more difficult examples. The process continues until the desired overall error rate is reached or performance no longer improves. At that point, each classifier's vote is weighted according to its accuracy on the training data on which it was built.

Though boosting principles can be applied to nearly any type of model, the principles are most commonly used with decision trees. We already used boosting in this way in Chapter 5, Divide and Conquer – Classification Using Decision Trees and Rules, as a method to improve the performance of a C5.0 decision tree.

The AdaBoost.M1 algorithm provides another tree-based implementation of AdaBoost for classification. The AdaBoost.M1 algorithm can be found in the adabag package.

Note

For more information about the adabag package, refer to adabag: An R Package for Classification with Boosting and Bagging, Alfaro, E, Gamez, M, Garcia, N, Journal of Statistical Software, 2013, Vol. 54, pp. 1-35.

Let's create an AdaBoost.M1 classifier for the credit data. The general syntax for this algorithm is similar to other modeling techniques:

> RNGversion("3.5.2")
> set.seed(300)
> m_adaboost <- boosting(default ~ ., data = credit)

As usual, the predict() function is applied to the resulting object to make predictions:

> p_adaboost <- predict(m_adaboost, credit)

Departing from convention, rather than returning a vector of predictions, this returns an object with information about the model. The predictions are stored in a sub-object called class:

> head(p_adaboost$class)
[1] "no"  "yes" "no"  "no"  "yes" "no"

A confusion matrix can be found in the confusion sub-object:

> p_adaboost$confusion
               Observed Class
Predicted Class  no yes
            no  700   0
            yes   0 300

Did you notice that the AdaBoost model made no mistakes? Before you get your hopes up, remember that the preceding confusion matrix is based on the model's performance on the training data. Since boosting allows the error rate to be reduced to an arbitrarily low level, the learner simply continued until it made no more errors. This likely resulted in overfitting on the training dataset.

For a more accurate assessment of performance on unseen data, we need to use another evaluation method. The adabag package provides a simple function to use 10-fold CV:

> RNGversion("3.5.2")
> set.seed(300)
> adaboost_cv <- boosting.cv(default ~ ., data = credit)

Depending on your computer's capabilities, this may take some time to run, during which it will log each iteration to screen—on my recent MacBook Pro computer, it took about four minutes. After it completes, we can view a more reasonable confusion matrix:

> adaboost_cv$confusion
               Observed Class
Predicted Class  no yes
            no  594 151
            yes 106 149

We can find the kappa statistic using the vcd package as described in Chapter 10, Evaluating Model Performance:

> library(vcd)
> Kappa(adaboost_cv$confusion)
            value    ASE     z  Pr(>|z|)
Unweighted 0.3607 0.0323 11.17 5.914e-29
Weighted   0.3607 0.0323 11.17 5.914e-29

With a kappa of about 0.361, this is our best-performing credit scoring model yet. Let's see how it compares to one last ensemble method.

Tip

The AdaBoost.M1 algorithm can be tuned in caret by specifying method = "AdaBoost.M1".

Random forests

Another ensemble-based method called random forests (or decision tree forests) focuses only on ensembles of decision trees. This method was championed by Leo Breiman and Adele Cutler, and combines the base principles of bagging with random feature selection to add additional diversity to the decision tree models. After the ensemble of trees (the forest) is generated, the model uses a vote to combine the trees' predictions.

Note

For more detail on how random forests are constructed, refer to Random Forests, Breiman, L, Machine Learning, 2001, Vol. 45, pp. 5-32.

Random forests combine versatility and power into a single machine learning approach. Because the ensemble uses only a small, random portion of the full feature set, random forests can handle extremely large datasets, where the so-called "curse of dimensionality" might cause other models to fail. At the same time, its error rates for most learning tasks are on a par with nearly any other method.

Tip

Although the term "random forests" is trademarked by Breiman and Cutler, the term is sometimes used colloquially to refer to any type of decision tree ensemble. A pedant would use the more general term "decision tree forests" except when referring to the specific implementation by Breiman and Cutler.

It's worth noting that relative to other ensemble-based methods, random forests are quite competitive and offer key advantages. For instance, random forests tend to be easier to use and less prone to overfitting. The following table lists the general strengths and weaknesses of random forest models:

Strengths

Weaknesses

  • An all-purpose model that performs well on most problems
  • Can handle noisy or missing data, as well as categorical or continuous features
  • Selects only the most important features
  • Can be used on data with an extremely large number of features or examples
  • Unlike a decision tree, the model is not easily interpretable

Due to their power, versatility, and ease of use, random forests are one of the most popular machine learning methods. Later in this chapter, we'll compare a random forest model head-to-head against the boosted C5.0 tree.

Training random forests

Though there are several packages to create random forests in R, the randomForest package is perhaps the implementation most faithful to the specification by Breiman and Cutler, and is also supported by caret for automated tuning. The syntax for training this model is as follows:

Training random forests

By default, the randomForest() function creates an ensemble of 500 trees that consider sqrt(p) random features at each split, where p is the number of features in the training dataset and sqrt() refers to R's square root function. Whether or not these default parameters are appropriate depends on the nature of the learning task and training data. Generally, more complex learning problems and larger datasets (both more features as well as more examples) work better with a larger number of trees, though this needs to be balanced with the computational expense of training more trees.

The goal of using a large number of trees is to train enough that each feature has a chance to appear in several models. This is the basis of the sqrt(p) default value for the mtry parameter; using this value limits the features sufficiently such that substantial random variation occurs from tree to tree. For example, since the credit data has 16 features, each tree would be limited to splitting on four features at any time.

Let's see how the default randomForest() parameters work with the credit data. We'll train the model just as we have done with other learners. Again, the set.seed() function ensures that the result can be replicated:

> library(randomForest)
> RNGversion("3.5.2")
> set.seed(300)
> rf <- randomForest(default ~ ., data = credit)

To look at a summary of the model's performance, we can simply type the resulting object's name:

> rf

Call:
 randomForest(formula = default ~ ., data = credit)
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 4

        OOB estimate of  error rate: 23.3%
Confusion matrix:
     no yes class.error
no  638  62  0.08857143
yes 171 129  0.57000000

The output shows that the random forest included 500 trees and tried four variables at each split, as expected. At first glance, you might be alarmed at the seemingly poor performance according to the confusion matrix—the error rate of 23.3 percent is far worse than the resubstitution error of any of the other ensemble methods so far. However, this confusion matrix does not show resubstitution error. Instead, it reflects the out-of-bag error rate (listed in the output as OOB estimate of error rate), which, unlike resubstitution error, is an unbiased estimate of the test set error. This means that it should be a fairly reasonable estimate of future performance.

The out-of-bag estimate is computed during the construction of the random forest. Essentially, any example not selected for a single tree's bootstrap sample can be used to test the model's performance on unseen data. At the end of the forest construction, for each of the 1,000 examples in the dataset, the trees that did not use the example in training are allowed to make a prediction. These predictions are tallied for each example and a vote is taken to determine the single final prediction for the example. The total error rate of such predictions becomes the out-of-bag error rate.

Tip

In Chapter 10, Evaluating Model Performance, it was stated that any given example has a 63.2 percent chance of being included in a bootstrap sample. This implies that an average of 36.8 percent of the 500 trees in the random forest voted for each of the 1,000 examples in the out-of-bag estimate.

To calculate the kappa statistic on the out-of-bag predictions, we can use the function in the vcd package as follows. The code applies the Kappa() function to the first two rows and columns of the confusion object, which stores the confusion matrix of the out-of-bag predictions for the rf random forest model object:

> library(vcd)
> Kappa(rf$confusion[1:2,1:2])
           value     ASE     z  Pr(>|z|)
Unweighted 0.381 0.03215 11.85 2.197e-32
Weighted   0.381 0.03215 11.85 2.197e-32

With a kappa statistic of 0.381, the random forest is our best-performing model yet. It was higher than the best boosted C5.0 decision tree, which had a kappa of about 0.325, and also higher than the AdaBoost.M1 model, which had a kappa of about 0.361. Given this impressive initial result, we should attempt a more formal evaluation of its performance.

Evaluating random forest performance in a simulated competition

As mentioned previously, the randomForest() function is supported by caret, which allows us to optimize the model while at the same time calculating performance measures beyond the out-of-bag error rate. To make things interesting, let's compare an auto-tuned random forest to the best auto-tuned boosted C5.0 model we've developed. We'll treat this experiment as if we were hoping to identify a candidate model for submission to a machine learning competition.

We must first load caret and set our training control options. For the most accurate comparison of model performance, we'll use repeated 10-fold CV, or 10-fold CV repeated 10 times. This means that the models will take a much longer time to build and will be more computationally intensive to evaluate, but since this is our final comparison, we should be very sure that we're making the right choice; the winner of this showdown, selected via the "best" performance metric, will be our only entry into the machine learning competition.

Additionally, we'll add a few new options to the trainControl() function. First, we'll set the savePredictions and classProbs parameters to TRUE, which saves the holdout sample predictions and predicted probabilities for plotting the ROC curve later. Then, we'll also set the summaryFunction to twoClassSummary, which is a caret function that computes performance metrics like AUC. The full control object is defined as follows:

> library(caret)
> ctrl <- trainControl(method = "repeatedcv",
                       number = 10, repeats = 10,
                       selectionFunction = "best",
                       savePredictions = TRUE,
                       classProbs = TRUE,
                       summaryFunction = twoClassSummary)

Next, we'll set up the tuning grid for the random forest. The only tuning parameter for this model is mtry, which defines how many features are randomly selected at each split. By default, we know that the random forest will use sqrt(16), or four features per tree. To be thorough, we'll also test values half of that, and twice that, as well as the full set of 16 features. Thus, we need to create a grid with values of 248, and 16 as follows:

> grid_rf <- expand.grid(mtry = c(2, 4, 8, 16))

Tip

A random forest that considers the full set of features at each split is essentially the same as a bagged decision tree model.

We'll supply the resulting grid to the train() function with the ctrl object, and use the "ROC" metric to select the best model. This metric refers to the area under the ROC curve. The complete experiment can be run as follows:

> RNGversion("3.5.2"); set.seed(300)
> m_rf <- train(default ~ ., data = credit, method = "rf",
                metric = "ROC", trControl = ctrl,
                tuneGrid = grid_rf)

The preceding command may take some time to complete, as it has quite a bit of work to do—on my recent MacBook Pro computer, it took about seven minutes! When the random forests are finished training, we'll compare the best forest to the best boosted decision tree among trees with 10, 25, 50, and 100 iterations using the following caret experiment:

> grid_c50 <- expand.grid(model = "tree",
                          trials = c(10, 25, 50, 100),
                          winnow = FALSE)
> RNGversion("3.5.2"); set.seed(300)
> m_c50 <- train(default ~ ., data = credit, method = "C5.0",
                 metric = "ROC", trControl = ctrl,
                 tuneGrid = grid_c50)

When the C5.0 decision tree finally completes, we can compare the two approaches side by side. For the random forest model, the results are:

> m_rf

Resampling results across tuning parameters:

  mtry  ROC        Sens       Spec      
   2    0.7579643  0.9900000  0.09766667
   4    0.7695071  0.9377143  0.30166667
   8    0.7739714  0.9064286  0.38633333
  16    0.7747905  0.8921429  0.44100000

For the boosted C5.0 model, the results are:

> m_c50

Resampling results across tuning parameters:

  trials  ROC        Sens       Spec     
   10     0.7399571  0.8555714  0.4346667
   25     0.7523238  0.8594286  0.4390000
   50     0.7559857  0.8635714  0.4436667
  100     0.7566286  0.8630000  0.4450000

Based on these head-to-head results, the random forest with mtry = 16 appears to be our winner, as its best AUC of 0.775 is better than the AUC of 0.757 for the best boosted C5.0 model.

To visualize their performance, we can use the pROC package to plot the ROC curves. We'll supply the roc() function with the observed (obs) values of the loan default, as well as the estimated probability of "yes" for the loan default. Note that these were saved by caret because we requested them via the trainControl() function. We can then plot() the ROC curves as follows:

> library(pROC)
> roc_rf <- roc(m_rf$pred$obs, m_rf$pred$yes)
> roc_c50 <- roc(m_c50$pred$obs, m_c50$pred$yes)
> plot(roc_rf, col = "red", legacy.axes = TRUE)
> plot(roc_c50, col = "blue", add = TRUE)

As anticipated, the resulting curves show that the random forest with an AUC of 0.775 slightly outperforms the boosted C5.0 model with an AUC of 0.757. The random forest is the outermost curve in the following R plot:

Evaluating random forest performance in a simulated competition

Figure 11.3: ROC curves comparing a random forest (outermost curve) to a boosted C5.0 decision tree on the loan default dataset

Based on our experiment, we would submit a random forest to the competition as our top-performing model. However, until it actually makes predictions on the competition test set, we have no way of knowing for sure whether our random forest will end up winning. Given our performance estimates, it's the safest bet of the models we evaluated, and with a bit of luck, perhaps we'll come away with the prize.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset