Using the functionality of mlr again, we first need to create an object with our base learners. This is once again classif.randomForest and, for a MARS model, we call the earth package with classif.earth:
> base <- c("classif.randomForest", "classif.earth")
You now make a learner with those base learners, and then specify that you want the output of those learners as the predicted probability:
> learns <- lapply(base, makeLearner)
> learns <- lapply(learns, setPredictType, "prob")
The process of building the base learning object is complete. I stated earlier that the ensembling learning algorithm will be GLM from glmnet. For just two base learners, a CART might be more appropriate, but let's demonstrate what's possible. There are a number of methods for stacking. In the following code block, I stack with cross-validation:
> sl <-
mlr::makeStackedLearner(
base.learners = learns,
super.learner = "classif.glmnet",
predict.type = "prob",
method = "stack.cv"
)
Now, it gets exciting as we train our stacked model:
stacked_fit <- mlr::train(sl, dna_task)
And we establish the predicted probabilities for the test data:
> pred_stacked <- predict(stacked_fit, newdata = test)
Just for a sanity check, let's look at the confusion matrix:
> mlr::calculateConfusionMatrix(pred_stacked)
predicted
true ei ie n -err.-
ei 144 4 5 9
ie 5 146 2 7
n 2 1 327 3
-err.- 7 5 7 19
The stacked model produced six fewer classification errors. The proof is in the metrics:
> mlr::performance(pred_stacked, measures = list(acc, logloss))
acc logloss
0.9701258 0.1101400
Of course, accuracy is better, but even better the log-loss improved substantially.
What have we learned? Using primarily one package, mlr, we built a good model with random forest, but by stacking random forest and MARS, we improved performance. Although all of that was with just a few lines of code, it's important to understand how to create and implement the pipeline.