Let's use decision trees to create software that can block banner ads on web pages. This program will predict whether each of the images on a web page is an advertisement or article content. Images that are classified as being advertisements could then be hidden using Cascading Style Sheets. We will train a decision tree classifier using the Internet Advertisements Data Set from http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements, which contains data for 3,279 images. The proportions of the classes are skewed; 459 of the images are advertisements and 2,820 are content. Decision tree learning algorithms can produce biased trees from data with unbalanced class proportions; we will evaluate a model on the unaltered data set before deciding if it is worth balancing the training data by over- or under-sampling instances. The explanatory variables are the dimensions of the image, words from the containing page's URL, words from the image's URL, the image's alt text, the image's anchor text, and a window of words surrounding the image tag. The response variable is the image's class. The explanatory variables have already been transformed into feature representations. The first three features are real numbers that encode the width, height, and aspect ratio of the images. The remaining features encode binary term frequencies for the text variables. In the following sample, we will grid search for the hyperparameter values that produce the decision tree with the greatest accuracy, and then evaluate the tree's performance on a test set:
import pandas as pd from sklearn.tree import DecisionTreeClassifier from sklearn.cross_validation import train_test_split from sklearn.metrics import classification_report from sklearn.pipeline import Pipeline from sklearn.grid_search import GridSearchCV
First we read the .csv
file using pandas. The .csv
does not have a header row, so we split the last column containing the response variable's values from the features using its index:
if __name__ == '__main__': df = pd.read_csv('data/ad.data', header=None) explanatory_variable_columns = set(df.columns.values) response_variable_column = df[len(df.columns.values)-1] # The last column describes the targets explanatory_variable_columns.remove(len(df.columns.values)-1) y = [1 if e == 'ad.' else 0 for e in response_variable_column] X = df[list(explanatory_variable_columns)]
We encoded the advertisements as the positive class and the content as the negative class. More than one quarter of the instances are missing at least one of the values for the image's dimensions. These missing values are marked by whitespace and a question mark. We replaced the missing values with negative one, but we could have imputed the missing values; for instance, we could have replaced the missing height values with the average height value:
X.replace(to_replace=' *?', value=-1, regex=True, inplace=True)
We then split the data into training and test sets:
X_train, X_test, y_train, y_test = train_test_split(X, y)
We created a pipeline and an instance of DecisionTreeClassifier
. Then, we set the criterion
keyword argument to entropy
to build the tree using the information gain heuristic:
pipeline = Pipeline([ ('clf', DecisionTreeClassifier(criterion='entropy')) ])
Next, we specified the hyperparameter space for the grid search:
parameters = { 'clf__max_depth': (150, 155, 160), 'clf__min_samples_split': (1, 2, 3), 'clf__min_samples_leaf': (1, 2, 3) }
We set GridSearchCV()
to maximize the model's F1 score:
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='f1') grid_search.fit(X_train, y_train) print 'Best score: %0.3f' % grid_search.best_score_ print 'Best parameters set:' best_parameters = grid_search.best_estimator_.get_params() for param_name in sorted(parameters.keys()): print ' %s: %r' % (param_name, best_parameters[param_name]) predictions = grid_search.predict(X_test) print classification_report(y_test, predictions) Fitting 3 folds for each of 27 candidates, totalling 81 fits [Parallel(n_jobs=-1)]: Done 1 jobs | elapsed: 1.7s [Parallel(n_jobs=-1)]: Done 50 jobs | elapsed: 15.0s [Parallel(n_jobs=-1)]: Done 71 out of 81 | elapsed: 20.7s remaining: 2.9s [Parallel(n_jobs=-1)]: Done 81 out of 81 | elapsed: 23.3s finished Best score: 0.878 Best parameters set: clf__max_depth: 155 clf__min_samples_leaf: 2 clf__min_samples_split: 1 precision recall f1-score support 0 0.97 0.99 0.98 710 1 0.92 0.81 0.86 110 avg / total 0.96 0.96 0.96 820
The classifier detected more than 80 percent of the ads in the test set, and approximately 92 percent of the images that it predicted were ads were truly ads. Overall, the performance is promising; in following sections, we will try to modify our model to improve its performance.
Ensemble learning methods combine a set of models to produce an estimator that has better predictive performance than its individual components. A random forest is a collection of decision trees that have been trained on randomly selected subsets of the training instances and explanatory variables. Random forests usually make predictions by returning the mode or mean of the predictions of their constituent trees; scikit-learn's implementations return the mean of the trees' predictions. Random forests are less prone to overfitting than decision trees because no single tree can learn from all of the instances and explanatory variables; no single tree can memorize all of the noise in the representation.
Let's update our ad blocker's classifier to use a random forest. It is simple to replace the DecisionTreeClassifier
using scikit-learn's API; we simply replace the object with an instance of RandomForestClassifier
. Like the previous example, we will grid search to find the values of the hyperparameters that produce the random forest with the best predictive performance.
First, import the RandomForestClassifier
class from the ensemble
module:
from sklearn.ensemble import RandomForestClassifier
Next, replace the DecisionTreeClassifier
in the pipeline
with an instance of RandomForestClassifier
and update the hyperparameter space:
pipeline = Pipeline([ ('clf', RandomForestClassifier(criterion='entropy')) ]) parameters = { 'clf__n_estimators': (5, 10, 20, 50), 'clf__max_depth': (50, 150, 250), 'clf__min_samples_split': (1, 2, 3), 'clf__min_samples_leaf': (1, 2, 3) }
The output is as follows:
Fitting 3 folds for each of 108 candidates, totalling 324 fits [Parallel(n_jobs=-1)]: Done 1 jobs | elapsed: 1.1s [Parallel(n_jobs=-1)]: Done 50 jobs | elapsed: 17.4s [Parallel(n_jobs=-1)]: Done 200 jobs | elapsed: 1.0min [Parallel(n_jobs=-1)]: Done 324 out of 324 | elapsed: 1.6min finished Best score: 0.936 Best parameters set: clf__max_depth: 250 clf__min_samples_leaf: 1 clf__min_samples_split: 3 clf__n_estimators: 20 precision recall f1-score support 0 0.97 1.00 0.98 705 1 0.97 0.83 0.90 115 avg / total 0.97 0.97 0.97 820
Replacing the single decision tree with a random forest resulted in a significant reduction of the error rate; the random forest improves the precision and recall for ads to 0.97 and 0.83.
The compromises associated with using decision trees are different than those of the other models we discussed. Decision trees are easy to use. Unlike many learning algorithms, decision trees do not require the data to have zero mean and unit variance. While decision trees can tolerate missing values for explanatory variables, scikit-learn's current implementation cannot. Decision trees can even learn to ignore explanatory variables that are not relevant to the task.
Small decision trees can be easy to interpret and visualize with the export_graphviz
function from scikit-learn's tree
module. The branches of a decision tree are conjunctions of logical predicates, and they are easily visualized as flowcharts. Decision trees support multioutput tasks, and a single decision tree can be used for multiclass classification without employing a strategy like one-versus-all.
Like the other models we discussed, decision trees are eager learners. Eager learners must build an input-independent model from the training data before they can be used to estimate the values of test instances, but can predict relatively quickly once the model has been built. In contrast, lazy learners such as the k-nearest neighbors algorithm defer all generalization until they must make a prediction. Lazy learners do not spend time training, but often predict slowly compared to eager learners.
Decision trees are more prone to overfitting than many of the models we discussed, as their learning algorithms can produce large, complicated decision trees that perfectly model every training instance but fail to generalize the real relationship. Several techniques can mitigate over-fitting in decision trees. Pruning is a common strategy that removes some of the tallest nodes and leaves of a decision tree, but it is not currently implemented in scikit-learn. However, similar effects can be achieved by setting a maximum depth for the tree or by creating child nodes only when the number of training instances they will contain exceeds a threshold. The DecisionTreeClassifier
and DecisionTreeRegressor
classes provide keyword arguments to set these constraints. Creating a random forest can also reduce over-fitting.
Efficient decision tree learning algorithms like ID3 are greedy. They learn efficiently by making locally optimal decisions, but are not guaranteed to produce the globally optimal tree. ID3 constructs a tree by selecting a sequence of explanatory variables to test. Each explanatory variable is selected because it reduces the uncertainty in the node more than the other variables. It is possible, however, that locally suboptimal tests are required in order to find the globally optimal tree.
In our toy examples, the size of the tree did not matter since we retained all of nodes. In a real application, however, the tree's growth could be limited by pruning or similar mechanisms. Pruning trees with different shapes can produce trees with different performances. In practice, locally optimal decisions that are guided by the information gain or Gini impurity heuristics often result in an acceptable decision trees.