Growing Extremely Randomized Trees

Extremely Randomized Trees, also known as the Extra trees algorithm differs from the Random Forest described in the previous recipe in two ways:

  1. It does not use bootstrapping to select instances for every tree in the ensemble; instead, it uses the complete training dataset.
  2. Given K as the number of attributes to be randomly selected at a given node, it selects a random cut-point without considering the target variable.

As you saw in the previous recipe, Random Forests used randomization in two places. First, in selecting the instances to be used for the training trees in the forest; bootstrap was used to select the training instances. Secondly, at every node a random set of attributes were selected. One attribute among them was selected based on either the gini impurity or entropy criterion. Extremely randomized trees go one step further and select the splitting attribute randomly.

Extremely Randomized Trees were proposed in the following paper:

P. Geurts, D. Ernst., and L. Wehenkel, "Extremely randomized trees", Machine Learning, 63(1), 3-42, 2006.

According to this paper, there are two aspects, other than the technical aspects listed earlier, which make an Extremely Randomized Tree more suitable:

The rationale behind the Extra-Trees method is that the explicit randomization of the cut-point and attribute combined with ensemble averaging should be able to reduce variance more strongly than the weaker randomization schemes used by other methods.

Compared to a Random Forest, randomization of the cut-point ( the attribute selected to split the dataset at each node) combined with the randomization of cut-point, that is, ignoring any criteria, and finally, averaging the results from each of the tree, will result in a much superior performance on an unknown dataset.

The second advantage is regarding the compute complexity:

From the computational point of view, the complexity of the tree growing procedure is, assuming balanced trees, on the order of N log N with respect to learning the sample size, like most other tree growing procedures. However, given the simplicity of the node splitting procedure we expect the constant factor to be much smaller than in other ensemble based methods which locally optimize cut-points

Since no computation time is spent in identifying the best attribute to split, this method is more computationally efficient than Random Forests.

Let us write down the steps involved in building Extremely Random trees. The number of trees required in the forest is typically specified by the user. Let T be the number of trees required to be built.

We start with iterating from 1 through T, that is, we build T trees:

  • For each tree, we select the complete input dataset.
  • We then proceed to fit a tree t to the input data:
    • Select m attributes randomly.
    • Pick an attribute randomly as the splitting variable.
    • Split the data set into two. Remember that trees are binary in nature. At each level of the tree, the input dataset is split into two.
    • Perform the preceding three steps recursively on the dataset that we split.
  • Finally, we return T trees.
  • Let us take a look at the recipe for Extremely Randomized Trees.

Getting ready…

We are going to generate some classification datasets to demonstrate Extremely Randomized Trees. For that, we will leverage Scikit Learn's implementation of the Extremely Randomized Trees ensemble module.

How to do it...

We start by loading all the necessary libraries. Let us leverage the make_classification method from the sklearn.dataset module to generate the training data:

from sklearn.datasets import make_classification
from sklearn.metrics import classification_report, accuracy_score
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.grid_search import RandomizedSearchCV
from operator import itemgetter

def get_data():
    """
    Make a sample classification dataset
    Returns : Independent variable y, dependent variable x
    """
    no_features = 30
    redundant_features = int(0.1*no_features)
    informative_features = int(0.6*no_features)
    repeated_features = int(0.1*no_features)
    x,y = make_classification(n_samples=500,n_features=no_features,flip_y=0.03,
            n_informative = informative_features, n_redundant = redundant_features 
            ,n_repeated = repeated_features,random_state=7)
    return x,y

We write the function build_forest, where we will build fully grown trees, and proceed to evaluate the forest's performance:

def build_forest(x,y,x_dev,y_dev):
    """
    Build a Extremely random tress
    and evaluate peformance
    """
    no_trees = 100
    estimator = ExtraTreesClassifier(n_estimators=no_trees,random_state=51)
    estimator.fit(x,y)
    
    train_predcited = estimator.predict(x)
    train_score = accuracy_score(y,train_predcited)
    dev_predicted = estimator.predict(x_dev)
    dev_score = accuracy_score(y_dev,dev_predicted)
    
    print "Training Accuracy = %0.2f Dev Accuracy = %0.2f"%(train_score,dev_score)
    print "cross validated score"
    print cross_val_score(estimator,x_dev,y_dev,cv=5)

def search_parameters(x,y,x_dev,y_dev):
    """
    Search the parameters 
    """
    estimator = ExtraTreesClassifier()
    no_features = x.shape[1]
    no_iterations = 20
    sqr_no_features = int(np.sqrt(no_features))

    parameters = {"n_estimators"      : np.random.randint(75,200,no_iterations),
                 "criterion"         : ["gini", "entropy"],
                 "max_features"      : [sqr_no_features,sqr_no_features*2,sqr_no_features*3,sqr_no_features+10]
                 }

    grid = RandomizedSearchCV(estimator=estimator,param_distributions=parameters,
    verbose=1, n_iter=no_iterations,random_state=77,n_jobs=-1,cv=5)
    grid.fit(x,y)
    print_model_worth(grid,x_dev,y_dev)
    
    return grid.best_estimator_

Finally, we write a main function for invoking the functions that we have defined:

if __name__ == "__main__":
    x,y = get_data()    

    # Divide the data into Train, dev and test    
    x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size = 0.3,random_state=9)
    x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9)
    
    build_forest(x_train,y_train,x_dev,y_dev)
    model = search_parameters(x,y,x_dev,y_dev)

How it works…

Let us start with our main function. We invoke get_data to get our predictor attributes in the response attributes. Inside get_data, we leverage the make_classification dataset to generate the training data for our recipe as follows:

def get_data():
    """
    Make a sample classification dataset
    Returns : Independent variable y, dependent variable x
    """
    no_features = 30
    redundant_features = int(0.1*no_features)
    informative_features = int(0.6*no_features)
    repeated_features = int(0.1*no_features)
    x,y = make_classification(n_samples=500,n_features=no_features,flip_y=0.03,
            n_informative = informative_features, n_redundant = redundant_features 
            ,n_repeated = repeated_features,random_state=7)
    return x,y

Let us look at the parameters that are passed to the make_classification method. The first parameter is the number of instances required; in this case, we say we need 500 instances. The second parameter is about the number of attributes required per instance. We say that we need 30. The third parameter, flip_y, randomly interchanges 3 percent of the instances. This is done to introduce some noise in our data. The next parameter specifies the number of features out of those 30 features should be informative enough to be used in our classification. We have specified that 60 percent of our features, that is, 18 out of 30 should be informative. The next parameter is about redundant features. These are generated as a linear combination of the informative features in order to introduce a correlation among the features. Finally, repeated features are the duplicate features, which are drawn randomly from both informative features and redundant features.

Let us split the data into the training and testing set using train_test_split. We reserve 30 percent of our data for testing:

    # Divide the data into Train, dev and test    
    x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size = 0.3,random_state=9)

Once again, we leverage train_test_split to split our test data into dev and test:

    x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9)

With the data divided for building, evaluating, and testing the model, we proceed to build our models:

build_forest(x_train,y_train,x_dev,y_dev)

We invoke the build_forest function with our training and dev data to build our Extremely Randomized trees model. Let us look inside that function:

no_trees = 100
    estimator = ExtraTreesClassifier(n_estimators=no_trees,random_state=51)
    estimator.fit(x,y)
    
    train_predcited = estimator.predict(x)
    train_score = accuracy_score(y,train_predcited)
    dev_predicted = estimator.predict(x_dev)
    dev_score = accuracy_score(y_dev,dev_predicted)
    
    print "Training Accuracy = %0.2f Dev Accuracy = %0.2f"%(train_score,dev_score)
    print "cross validated score"
    print cross_val_score(estimator,x_dev,y_dev,cv=5)

We need 100 trees in our ensemble, so we use the variable no_trees to define the number of trees. We leverage the ExtraTreesClassifier class from Scikit learn. As you can see, we pass the number of trees required as a parameter. A point to note here is the parameter bootstrap. Refer to the following URL for the parameters for the ExtraTreesClassifier:

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

The parameter bootstrap is set to False by default. Compare it with the RandomForestClassifier bootstrap parameter given at the following URL:

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

As explained earlier, every tree in the forest is trained with all the records.

We proceed to fit our model as follows:

    train_predcited = estimator.predict(x)

We then proceed to find the model accuracy score for our train and dev data:

    train_score = accuracy_score(y,train_predcited)
    dev_predicted = estimator.predict(x_dev)
    dev_score = accuracy_score(y_dev,dev_predicted)

Let us print the scores for the training and dev dataset:

    print "Training Accuracy = %0.2f Dev Accuracy = %0.2f"%(train_score,dev_score)
How it works…

Let us now do a five-fold cross validation to look at the model predictions:

How it works…

Pretty good results. We almost have a 90 percent accuracy rate for one of the folds. We can do a randomized search across the parameter space as we did for Random Forest. Let us invoke the function search_parameters with our train and test dataset. Refer to the previous recipe for an explanation of RandomizedSearchCV. We will then print the output of the search_parameters function:

How it works…

As in the previous recipe, we have ranked the models by their scores in descending order, thus showing the best model parameters in the beginning. We will choose these parameters as our model parameters. The attribute best_estimator_ will return the model with these parameters.

What you see next is the classification report generated for the best estimator. The predict function will use best_estimator_ internally. The report was generated by the following code:

dev_predicted = grid.predict(x_dev)
print classification_report(y_dev,dev_predicted)

Great! We have a perfect model with a classification accuracy of 100 percent.

There's more…

Extremely Randomized Trees are very popular with the time series classification problems. Refer to the following paper for more information:

Geurts, P., Blanco Cuesta A., and Wehenkel, L. (2005a). Segment and combine approach for biological sequence classification. In: Proceedings of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, 194–201.

See also

  • Building Decision Trees to solve Multi Class Problems recipe in Chapter 6, Machine Learning I
  • Understanding Ensemble, Bagging Method recipe in Chapter 8, Model Selection and Evaluation
  • Growing from trees to Forest, Random Forest recipe in Chapter 9, Machine Learning III
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset