Extremely Randomized Trees, also known as the Extra trees algorithm differs from the Random Forest described in the previous recipe in two ways:
As you saw in the previous recipe, Random Forests used randomization in two places. First, in selecting the instances to be used for the training trees in the forest; bootstrap was used to select the training instances. Secondly, at every node a random set of attributes were selected. One attribute among them was selected based on either the gini impurity or entropy criterion. Extremely randomized trees go one step further and select the splitting attribute randomly.
Extremely Randomized Trees were proposed in the following paper:
P. Geurts, D. Ernst., and L. Wehenkel, "Extremely randomized trees", Machine Learning, 63(1), 3-42, 2006.
According to this paper, there are two aspects, other than the technical aspects listed earlier, which make an Extremely Randomized Tree more suitable:
The rationale behind the Extra-Trees method is that the explicit randomization of the cut-point and attribute combined with ensemble averaging should be able to reduce variance more strongly than the weaker randomization schemes used by other methods.
Compared to a Random Forest, randomization of the cut-point ( the attribute selected to split the dataset at each node) combined with the randomization of cut-point, that is, ignoring any criteria, and finally, averaging the results from each of the tree, will result in a much superior performance on an unknown dataset.
The second advantage is regarding the compute complexity:
From the computational point of view, the complexity of the tree growing procedure is, assuming balanced trees, on the order of N log N with respect to learning the sample size, like most other tree growing procedures. However, given the simplicity of the node splitting procedure we expect the constant factor to be much smaller than in other ensemble based methods which locally optimize cut-points
Since no computation time is spent in identifying the best attribute to split, this method is more computationally efficient than Random Forests.
Let us write down the steps involved in building Extremely Random trees. The number of trees required in the forest is typically specified by the user. Let T be the number of trees required to be built.
We start with iterating from 1 through T, that is, we build T trees:
We are going to generate some classification datasets to demonstrate Extremely Randomized Trees. For that, we will leverage Scikit Learn's implementation of the Extremely Randomized Trees ensemble module.
We start by loading all the necessary libraries. Let us leverage the make_classification
method from the sklearn.dataset
module to generate the training data:
from sklearn.datasets import make_classification
from sklearn.metrics import classification_report, accuracy_score
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.grid_search import RandomizedSearchCV
from operator import itemgetter
def get_data():
"""
Make a sample classification dataset
Returns : Independent variable y, dependent variable x
"""
no_features = 30
redundant_features = int(0.1*no_features)
informative_features = int(0.6*no_features)
repeated_features = int(0.1*no_features)
x,y = make_classification(n_samples=500,n_features=no_features,flip_y=0.03,
n_informative = informative_features, n_redundant = redundant_features
,n_repeated = repeated_features,random_state=7)
return x,y
We write the function build_forest
, where we will build fully grown trees, and proceed to evaluate the forest's performance:
def build_forest(x,y,x_dev,y_dev): """ Build a Extremely random tress and evaluate peformance """ no_trees = 100 estimator = ExtraTreesClassifier(n_estimators=no_trees,random_state=51) estimator.fit(x,y) train_predcited = estimator.predict(x) train_score = accuracy_score(y,train_predcited) dev_predicted = estimator.predict(x_dev) dev_score = accuracy_score(y_dev,dev_predicted) print "Training Accuracy = %0.2f Dev Accuracy = %0.2f"%(train_score,dev_score) print "cross validated score" print cross_val_score(estimator,x_dev,y_dev,cv=5) def search_parameters(x,y,x_dev,y_dev): """ Search the parameters """ estimator = ExtraTreesClassifier() no_features = x.shape[1] no_iterations = 20 sqr_no_features = int(np.sqrt(no_features)) parameters = {"n_estimators" : np.random.randint(75,200,no_iterations), "criterion" : ["gini", "entropy"], "max_features" : [sqr_no_features,sqr_no_features*2,sqr_no_features*3,sqr_no_features+10] } grid = RandomizedSearchCV(estimator=estimator,param_distributions=parameters, verbose=1, n_iter=no_iterations,random_state=77,n_jobs=-1,cv=5) grid.fit(x,y) print_model_worth(grid,x_dev,y_dev) return grid.best_estimator_
Finally, we write a main function for invoking the functions that we have defined:
if __name__ == "__main__": x,y = get_data() # Divide the data into Train, dev and test x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size = 0.3,random_state=9) x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9) build_forest(x_train,y_train,x_dev,y_dev) model = search_parameters(x,y,x_dev,y_dev)
Let us start with our main function. We invoke get_data
to get our predictor attributes in the response attributes. Inside get_data
, we leverage the make_classification dataset to generate the training data for our recipe as follows:
def get_data(): """ Make a sample classification dataset Returns : Independent variable y, dependent variable x """ no_features = 30 redundant_features = int(0.1*no_features) informative_features = int(0.6*no_features) repeated_features = int(0.1*no_features) x,y = make_classification(n_samples=500,n_features=no_features,flip_y=0.03, n_informative = informative_features, n_redundant = redundant_features ,n_repeated = repeated_features,random_state=7) return x,y
Let us look at the parameters that are passed to the make_classification
method. The first parameter is the number of instances required; in this case, we say we need 500 instances. The second parameter is about the number of attributes required per instance. We say that we need 30. The third parameter, flip_y
, randomly interchanges 3 percent of the instances. This is done to introduce some noise in our data. The next parameter specifies the number of features out of those 30 features should be informative enough to be used in our classification. We have specified that 60 percent of our features, that is, 18 out of 30 should be informative. The next parameter is about redundant features. These are generated as a linear combination of the informative features in order to introduce a correlation among the features. Finally, repeated features are the duplicate features, which are drawn randomly from both informative features and redundant features.
Let us split the data into the training and testing set using train_test_split
. We reserve 30 percent of our data for testing:
# Divide the data into Train, dev and test x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size = 0.3,random_state=9)
Once again, we leverage train_test_split to split our test data into dev and test:
x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9)
With the data divided for building, evaluating, and testing the model, we proceed to build our models:
build_forest(x_train,y_train,x_dev,y_dev)
We invoke the build_forest
function with our training and dev data to build our Extremely Randomized trees model. Let us look inside that function:
no_trees = 100 estimator = ExtraTreesClassifier(n_estimators=no_trees,random_state=51) estimator.fit(x,y) train_predcited = estimator.predict(x) train_score = accuracy_score(y,train_predcited) dev_predicted = estimator.predict(x_dev) dev_score = accuracy_score(y_dev,dev_predicted) print "Training Accuracy = %0.2f Dev Accuracy = %0.2f"%(train_score,dev_score) print "cross validated score" print cross_val_score(estimator,x_dev,y_dev,cv=5)
We need 100 trees in our ensemble, so we use the variable no_trees to define the number of trees. We leverage the ExtraTreesClassifier
class from Scikit learn. As you can see, we pass the number of trees required as a parameter. A point to note here is the parameter bootstrap. Refer to the following URL for the parameters for the ExtraTreesClassifier:
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html
The parameter bootstrap is set to False
by default. Compare it with the RandomForestClassifier
bootstrap parameter given at the following URL:
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
As explained earlier, every tree in the forest is trained with all the records.
We proceed to fit our model as follows:
train_predcited = estimator.predict(x)
We then proceed to find the model accuracy score for our train and dev data:
train_score = accuracy_score(y,train_predcited) dev_predicted = estimator.predict(x_dev) dev_score = accuracy_score(y_dev,dev_predicted)
Let us print the scores for the training and dev dataset:
print "Training Accuracy = %0.2f Dev Accuracy = %0.2f"%(train_score,dev_score)
Let us now do a five-fold cross validation to look at the model predictions:
Pretty good results. We almost have a 90 percent accuracy rate for one of the folds. We can do a randomized search across the parameter space as we did for Random Forest. Let us invoke the function search_parameters
with our train and test dataset. Refer to the previous recipe for an explanation of RandomizedSearchCV. We will then print the output of the search_parameters
function:
As in the previous recipe, we have ranked the models by their scores in descending order, thus showing the best model parameters in the beginning. We will choose these parameters as our model parameters. The attribute best_estimator_ will
return the model with these parameters.
What you see next is the classification report generated for the best estimator. The predict function will use best_estimator_ internally
. The report was generated by the following code:
dev_predicted = grid.predict(x_dev) print classification_report(y_dev,dev_predicted)
Great! We have a perfect model with a classification accuracy of 100 percent.
Extremely Randomized Trees are very popular with the time series classification problems. Refer to the following paper for more information:
Geurts, P., Blanco Cuesta A., and Wehenkel, L. (2005a). Segment and combine approach for biological sequence classification. In: Proceedings of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, 194–201.