Creating an ensemble

Using the functionality of mlr again, we first need to create an object with our base learners. This is once again classif.randomForest and, for a MARS model, we call the earth package with

> base <- c("classif.randomForest", "")

You now make a learner with those base learners, and then specify that you want the output of those learners as the predicted probability:

> learns <- lapply(base, makeLearner)

> learns <- lapply(learns, setPredictType, "prob")

The process of building the base learning object is complete. I stated earlier that the ensembling learning algorithm will be GLM from glmnet. For just two base learners, a CART might be more appropriate, but let's demonstrate what's possible. There are a number of methods for stacking. In the following code block, I stack with cross-validation:

> sl <-
base.learners = learns,
super.learner = "classif.glmnet",
predict.type = "prob",
method = ""

Now, it gets exciting as we train our stacked model:

stacked_fit <- mlr::train(sl, dna_task)

And we establish the predicted probabilities for the test data:

> pred_stacked <- predict(stacked_fit, newdata = test)

Just for a sanity check, let's look at the confusion matrix:

> mlr::calculateConfusionMatrix(pred_stacked)
true ei ie n -err.-
ei 144 4 5 9
ie 5 146 2 7
n 2 1 327 3
-err.- 7 5 7 19

The stacked model produced six fewer classification errors. The proof is in the metrics:

> mlr::performance(pred_stacked, measures = list(acc, logloss))
acc logloss
0.9701258 0.1101400

Of course, accuracy is better, but even better the log-loss improved substantially.

What have we learned? Using primarily one package, mlr, we built a good model with random forest, but by stacking random forest and MARS, we improved performance. Although all of that was with just a few lines of code, it's important to understand how to create and implement the pipeline. 

