Tree methods

In many datasets, the relationship between our inputs and output may not be a straight line. For example, consider the relationship between hour of the day and probability of posting on social media. If you were to draw a plot of this probability, it would likely increase during the evening and lunch break, and decline during the night, morning and workday, forming a sinusoidal wave pattern. A linear model cannot represent this kind of relationship, as the value of the response does not strictly increase or decrease with the hour of the day. What models, then, could we use to capture this relationship? In the specific case of time series models we could use approaches such as the Kalman filter described above, using the components of the structural time series equation to represent the cyclical 24-hour pattern of social media activity. In the following section we examine more general approaches that will apply both to time series data and to more general non-linear relationships.

Decision trees

Consider a case where we assign a probability of posting on social media when hour > 11 am and < 1 pm, > 1 pm and < 6 pm, and so forth. We could visualize these as the branches of a tree, where at each branch point we have a condition (such as hour < 6 pm), and assign our input data to one branch or another. We continue this sort of branching until we reach the end of a series of such selections, called a "leaf" of the tree; the predicted response for the tree is then the average of the value of training data points in this last group. To predict the response of new data points, we follow the branches of the tree to the bottom. To compute a decision tree, then, we need the following steps:

  1. Start with a training set of features X and responses y.
  2. Find the column of X, along with the dividing point, which optimizes the split between the data points. There are several criteria we could optimize, such as the variance of the target response on each side of the decision boundary (see split functions, later). (Breiman, Leo, et al. Classification and regression trees. CRC press, 1984.). We assign training data points to two new branches based on this rule.
  3. Repeat step 2 until a stopping rule is reached, or there is only a single value in each of the final branches of the tree.
  4. The predicted response is then given by the average response of the training points that end up in a particular branch of the tree.

As mentioned previously, every time we select a split point in the tree model, we need a principled way of determining whether one candidate variable is better than another for dividing the data into groups that have more correlated responses. There are several options.

Variance Reduction measures weather the two groups formed after splitting the data have lower variance in the response variable y than the data as a whole, and is used in the Classification and Regression Trees (CART) algorithm for decision trees (Breiman, Leo, et al. Classification and regression trees. CRC press, 1984.). It may be calculated by:

Decision trees

Where A is the set of all data points before the split, L is the set of values that fall to the left side of the split, and R is the set of points that falls to the right side. This formula is optimized when the combined variance of the two sides of the split point are less than the variance in the original data.

Variance reduction will work best for problems like the one we are examining in this chapter, where the output is a continuous variable. However, in classification problems with categorical outcomes, such as we will examine in Chapter 5, Putting Data in its Place – Classification Methods and Analysis the variance becomes less meaningful because the data can only assume fixed number of values (1 or 0 for a particular class). Another statistic we might optimize is the "information gain," which is used in the Iterative Dichotomiser 3 (ID3) and C4.5 algorithms for building decision trees (Quinlan, J. Ross. C4. 5: programs for machine learning. Elsevier, 2014; Quinlan, J. Ross. "Induction of decision trees." Machine learning 1.1 (1986): 81-106). The information gain statistic asks whether the data on the left and right sides of the decision split become more or less similar after being partitioned. If we considered the response y to be a probability, then the information gain is calculated as:

Decision trees

Where α is the fraction of data that is divided to the left side of the split, and fAk, fLk, and fRk are the fraction of elements in class k among all data points, the Left side, and the Right side of the split. The three terms of this equation are known as Entropies (Borda, Monica. Fundamentals in information theory and coding. Springer Science & Business Media, 2011). Why does the entropy reflect a good split of the data? To see this, plot the values from 0 to 1 for the function ylog2y using the following:

>>> probs = np.arange(0.01,1,0.01)
>>> entropy = [ -1*np.log2(p)*p for p in probs]
>>> plt.plot(probs,entropy)
>>> plt.xlabel('y')
>>>plt.ylabel('Entropy')

Look at the result:

Decision trees

You can appreciate that the entropy drops as y approaches 0 or 1. This corresponds to a very high probability or low probability of a particular class in a classification problem, and thus trees which split data according to information gain will maximize the extent to which the left and right branches tend towards or against probability for a given class in the response.

Similarly, the CART algorithm (Breiman, Leo, et al. Classification and regression trees. CRC press, 1984.). Also use the Gini impurity to decide the split points, calculated as:

Decision trees

Inspecting this formula, you can see it will be maximized when one class is near a value of f = 1, while all others are 0.

How could we deal with null values and missing data in such a model? In scikit-learn, the current decision tree implementation does not accommodate missing values, so we either need to insert a placeholder value (such as -1), drop missing records, or impute them (for example, replacing with the column mean) (see Aside for more details). However, some implementations (such as the R statistical programming language's gbm package) treat missing data as a third branch into which to sort data.

Similar diversity is present in the treatment of categorical data. The current scikit-learn implementation expects only numerical columns, meaning that categorical features such as gender or country need to be encoded as binary indicators. However, other packages, such as the implementation in R, treat categorical data by assigning data into buckets based on their feature value, then sorting the buckets by average response to determine which buckets to assign to the left and right branches of a tree.

Tip

Aside: dealing with missing data

When dealing with missing values in data, we need to consider several possibilities. One is whether the data is missing at random, or missing not at random. In the first case, there is a correlation between the response variable and the fact that the data is missing. We could assign a dummy value (such as -1), remove the whole row with missing data from our analysis, or assign the column mean or median as a placeholder. We could also think of more sophisticated approaches, such as training a regression model that uses all the other input variables as predictors and the column with the missing data as the output response, and derive imputed values using the predictions from this model. If data is missing not at random, then simply encoding the data with a placeholder is probably not sufficient, as the placeholder value is correlated with the response. In this scenario, we may remove the rows with missing data, or if this is not possible, employ the model-based approach. This will be preferable to infer the value of the missing elements in the data as it should predict values that follow the same distribution as the rest of the column.

In practice, while constructing the tree, we usually have some stopping rule, such as the minimum number of observations that are needed to form a leaf node (otherwise the predicted response could come from a small number of data points, which usually increases the error of the predictions).

It is not clear at the outset how many times the tree should branch. If there are too few splits (decision points), then few rules can be applied to subdivide the dataset and the resulting accuracy of the model may be low. If there are too many splits in a very deep tree, then the model may not generalize well to a new set of data. For our example, let us try fitting the tree to a number of different depths:

>>> from sklearn.tree import DecisionTreeRegressor
>>> max_depths = [2,4,6,8,32,64,128,256]
>>> dtrees = []
>>> for m in max_depths:
…    dtrees.append(DecisionTreeRegressor(min_samples_leaf=20,max_depth=m).
…    fit(news_features_train, news_shares_train))

Now, we can evaluate the results by plotting the R2 against the tree depth for each model:

>>> r2_values = []
>>> for d in dtrees:
…    r2_values.append(d.score(news_features_test, news_shares_test))
>>> plt.plot(max_depths,r2_values,color='red')
>>> plt.xlabel('maximum depth')
>>> plt.ylabel('r-squared')

Looking at the performance on the test set, we can see that the gains in performance drop quickly once we make the tree too deep:

Decision trees

Unfortunately, the tree model is still not performing much better than our basic linear regression. To try and improve this, we can try to increase the number of trees rather than the depth of the trees. The intuition here is that a set of shallower trees may in combination capture relationships that it is difficult for a single deep tree to approximate. This approach, of using a combination of smaller models to fit complex relationships, is used both in the Random Forest algorithm discussed in the following section, and in Gradient Boosted Decision Trees (Chapter 5, Putting Data in its Place – Classification Methods and Analysis) and, in a sense, in the deep learning models we will discuss in Chapter 7, Learning from the Bottom Up – Deep Networks and Unsupervised Features.

Random forest

While the idea of capturing non-linear relationships seems reasonable, it is possible that it is difficult to construct a single tree that captures such complex relationships between input and output. What if we were to average over many simpler decision trees? This is the essence of the Random Forest algorithm (Ho, Tin Kam. "Random decision forests." Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on. Vol. 1. IEEE, 1995.; Breiman, Leo. "Random forests." Machine learning 45.1 (2001): 5-32.), in which we construct several trees to try and explore the space of possible nonlinear interactions.

Random Forests are an innovation on the concept of Bootstrap Aggregation (Bagging) for tree models (Breiman, Leo. "Bagging predictors." Machine learning 24.2 (1996): 123-140.). In the generic Bagging algorithm, we construct a large number of trees by sampling, with replacement from the training data a small number of data points and building a tree only on this subset of data. While individual trees will be relatively weak models, by averaging over a large number of trees we can often achieve better prediction performance. Conceptually this is because instead of trying to fit a single model (such as a single line) through the response, we are approximating the response using an ensemble of small models that each fit a single simpler pattern in the input data.

Random Forests are a further development on the idea of Bagging by randomizing not just the data used to build each tree, but the variables as well. At each step we also only consider a random subset (for example, of size equal to the square root of the total number of columns) of the columns of X while constructing the splits in the tree (Hastie, Trevor, et al. "The elements of statistical learning: data mining, inference and prediction." The Mathematical Intelligencer 27.2 (2005): 83-85.). If we used all input columns at each round of training, we would tend to select the same variables that are most strongly correlated with the response. By instead randomly selecting a subset of variables, we can also find patterns among weaker predictors and more widely cover the space of possible feature interactions. As with Bagging, we follow this process of random data and variable selection many times, and then average together the predictions of all trees to reach an overall prediction. Again, we can explore whether varying a parameter (the number of trees) improves performance on the test set:

>>> from sklearn import ensemble
>>> rforests = []
>>> num_trees = [2,4,6,8,32,64,128,256]
>>> for n in num_trees:
 …   rforests.
 …   append(ensemble.RandomForestRegressor(n_estimators=n,min_samples_leaf=20).
 …   fit(news_features_train, news_shares_train))

Finally, we can start to see some increases in the accuracy of the model when we plot the results using the following code:

>>> r2_values_rforest = []
>>> for f in rforests:
 …   r2_values_rforest.append(f.score(news_features_test, news_shares_test))
>>> plt.plot(num_trees,r2_values_rforest,color='red')
>>> plot.xlabel('Number of Trees')
>>> plot.ylabel('r-squared')
Random forest

Like the linear regression model, we can get a ranking of feature importance. While for the linear regression, it is simply the magnitude of the slopes, in the random forest model the importance of features is determined in a more complex manner. Intuitively, if we were to shuffle the values of a particular column among the rows in the dataset, it should decrease the performance of the model if the column is important. By measuring the average effect of this permutation across all trees and dividing by the standard deviation of this effect, we can get can a ranking of the magnitude and consistency of the impact of a variable on the performance of the model. By ranking variables by the degree this randomization has on accuracy, we can derive a measure of feature significance. We can examine the important variables using the following commands to select the feature importance values of the largest random forest. Since the np.argsort command will by default return a list in ascending order, we use the [::-1] slice to invert the list order to place the large coefficient values at the beginning.

>>> ix = np.argsort(abs(f[5].feature_importances_))[::-1]
>>> news_trimmed_features.columns[ix]

This gives the following result:

 Index(['kw_avg_avg', 'self_reference_avg_sharess', 'timedelta', 'LDA_01',        'kw_max_avg', 'n_unique_tokens', 'data_channel_is_tech', 'LDA_02',        'self_reference_min_shares', 'n_tokens_content', 'LDA_03', 'kw_avg_max',        'global_rate_negative_words', 'avg_negative_polarity',        'global_rate_positive_words', 'average_token_length', 'num_hrefs',        'is_weekend', 'global_subjectivity', 'kw_avg_min',        'n_non_stop_unique_tokens', 'kw_min_max', 'global_sentiment_polarity',        'kw_max_min', 'LDA_04', 'kw_min_avg', 'min_positive_polarity',        'num_self_hrefs', 'avg_positive_polarity', 'self_reference_max_shares',        'title_sentiment_polarity', 'max_positive_polarity', 'n_tokens_title',        'abs_title_sentiment_polarity', 'abs_title_subjectivity',        'title_subjectivity', 'min_negative_polarity', 'num_imgs',        'data_channel_is_socmed', 'rate_negative_words', 'num_videos',        'max_negative_polarity', 'rate_positive_words', 'kw_min_min',        'num_keywords', 'data_channel_is_entertainment', 'weekday_is_wednesday',        'data_channel_is_lifestyle', 'weekday_is_friday', 'weekday_is_monday',        'kw_max_max', 'data_channel_is_bus', 'data_channel_is_world',        'n_non_stop_words', 'weekday_is_saturday', 'weekday_is_tuesday',        'weekday_is_thursday'],       dtype='object')

Interestingly, if you compare this list with the linear regression model, the order is quite different. Promisingly, this suggests that the random forest was able to incorporate patterns that a linear regression cannot capture, resulting in the gains in R2 seen in this section.

There is also a somewhat subtle problem in this dataset, in the sense that all the categorical variables have been encoded using a binary flag. The variable importance is thus applied individually to each member of a category. If one member of a category is highly correlated with the response while another is not, these individual variables' importance measures give an inaccurate picture of the true variable importance. One solution is to average the resulting values over all categories, a correction which we will not apply for now but raise as a consideration for your future analyses.

Here we provide a visual flowchart illustrating many of the tradeoffs we have discussed in this chapter on regression analysis. While it is difficult to provide comprehensive rules-of-thumb for all scenarios, it can serve as a starting point for diagnosing which method to apply for a given problem:

Random forest

Flowchart for regression analysis

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset