Chapter 6. Classification and Regression Trees

 

"The classifiers most likely to be the best are the random forest (RF) versions, the best of which (implemented in R and accessed via caret), achieves 94.1 percent of the maximum accuracy overcoming 90 percent in the 84.3 percent of the data sets."

 
 --Fernández-Delgado et al. (2014)

Introduction

This quote from Fernández-Delgado et al. in the Journal of Machine Learning Research is meant to set the stage that the techniques in this chapter are quite powerful, particularly when used for classification problems. Certainly, they are not always the best solution but they do provide a good starting point.

In the previous chapters, we examined the techniques to predict either a quantity or a label classification. Here we will apply them on both types of problems. We will also approach the business problem differently than in the previous chapters. Instead of defining a new problem, we will apply the techniques to some of the issues that we already tackled, with an eye to see if we can improve our predictive power. For all intents and purposes, the business case in this chapter is to see if we can improve on the models that we selected before.

The first item of discussion is the basic decision tree, which is both simple to build and to understand. However, the single decision tree method does not perform as well as the other methods that you learned, for example, the support vector machines, or will learn, such as the neural networks. Therefore, we will discuss the creation of multiple, sometimes hundreds, of different trees with their individual results combined, leading to a single overall prediction. These methods, as the paper referenced at the beginning of this chapter states, perform as well or better than any technique in this book. These methods are known as random forests and gradient boosted trees.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset