Understanding the regression trees

To establish an understanding of tree-based methods, it is probably easier to start with a quantitative outcome and then move on to how it works in a classification problem. The essence of a tree is that the features are partitioned, starting with the first split that improves the RSS the most. These binary splits continue until the termination of the tree. Each subsequent split/partition is not done on the entire dataset but only on the portion of the prior split that it falls under. This top-down process is referred to as recursive partitioning. It is also a process that is greedy, a term you may stumble upon in reading about the machine learning methods. Greedy means that, during each split in the process, the algorithm looks for the greatest reduction in the RSS without any regard to how well it will perform on the latter partitions. The result is that you may end up with a full tree of unnecessary branches leading to a low bias but a high variance. To control this effect, you need to appropriately prune the tree to an optimal size after building a full tree.

Figure 6.1, provides a visual of this technique in action. The data is hypothetical with 30 observations, a response ranging from 1 to 10, and two predictor features, both ranging in value from 0 to 10 named X1 and X2. The tree has three splits leading to four terminal nodes. Each split is basically an if...then statement or uses an R syntax ifelse(). The first split is--if X1 is less than 3.5, then the response is split into four observations with an average value of 2.4 and the remaining 26 observations. This left branch of four observations is a terminal node as any further splits would not substantially improve the RSS. The predicted value for these four observations in that partition of the tree becomes the average. The next split is at X2 < 4 and finally, X1 < 7.5.
An advantage of this method is that it can handle highly nonlinear relationships; however, can you see a couple of potential problems? The first issue is that an observation is given the average of the terminal node under which it falls. This can hurt the overall predictive performance (high bias). Conversely, if you keep partitioning the data further and further so as to achieve a low bias, a high variance can become an issue. As with the other methods, you can use cross-validation to select the appropriate tree depth size:

Figure 6.1: Regression Tree with 3 splits and 4 terminal nodes and the corresponding node average and number of observations
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset