Chapter 5. Nonlinear Classification and Regression with Decision Trees

In the previous chapters we discussed generalized linear models, which relate a linear combination of explanatory variables to one or more response variables using a link function. You learned to use multiple linear regression to solve regression problems, and we used logistic regression for classification tasks. In this chapter we will discuss a simple, nonlinear model for classification and regression tasks: the decision tree. We'll use decision trees to build an ad blocker that can learn to classify images on a web page as banner advertisements or page content. Finally, we will introduce ensemble learning methods, which combine a set of models to produce an estimator with better predictive performance than any of its component estimators.

Decision trees

Decision trees are tree-like graphs that model a decision. They are analogous to the parlor game Twenty Questions. In Twenty Questions, one player, called the answerer, chooses an object but does not reveal the object to the other players, who are called questioners. The object should be a common noun, such as "guitar" or "sandwich", but not "1969 Gibson Les Paul Custom" or "North Carolina". The questioners must guess the object by asking as many as twenty questions that can be answered with yes, no, or maybe. An intuitive strategy for questioners is to ask questions of increasing specificity; asking "is it a musical instrument?" as the first question will not efficiently reduce the number of possibilities. The branches of a decision tree specify the shortest sequences of explanatory variables that can be examined in order to estimate the value of a response variable. To continue the analogy, in Twenty Questions the questioner and the answerers all have knowledge of the training data, but only the answerer knows the values of the features for the test instance.

Decision trees are commonly learned by recursively splitting the set of training instances into subsets based on the instances' values for the explanatory variables. The following diagram depicts a decision tree that we will look at in more detail later in the chapter.

Decision trees

Represented by boxes, the interior nodes of the decision tree test explanatory variables. These nodes are connected by edges that specify the possible outcomes of the tests. The training instances are divided into subsets based on the outcomes of the tests. For example, a node might test whether or not the value of an explanatory variable exceeds a threshold. The instances that pass the test will follow an edge to the node's right child, and the instances that fail the test will follow an edge to the node's left child. The children nodes similarly test their subsets of the training instances until a stopping criterion is satisfied. In classification tasks, the leaf nodes of the decision tree represent classes. In regression tasks, the values of the response variable for the instances contained in a leaf node may be averaged to produce the estimate for the response variable. After the decision tree has been constructed, making a prediction for a test instance requires only following the edges until a leaf node is reached.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset