Understanding Ensemble – Bagging Method

Ensemble methods belong to the family of methods known as committee-based learning. Instead of leaving the decision of classification or regression to a single model, a group of models is used to make decisions in an ensemble. Bagging is a famous and widely used ensemble method.

Bagging is also known as bootstrap aggregation. Bagging can be made effective only if we are able to introduce variability in the underlying models, that is, if we can successfully introduce variability in the underlying dataset, it will lead to models with slight variations.

We leverage Bootstrapping to fed to these models variability in our dataset. Bootstrapping is the process by which we randomly sample the given dataset for a specified number of instances, with or without replacement. In bagging, we leverage bootstrapping to generate, say, m is the different datasets and construct a model for each of them. Finally, we average the output of all the models to produce the final prediction in case of regression problems.

Let us say we bootstrap the data m times, we would have m models, that is, y m values, and our final prediction would be as follows:

Understanding Ensemble – Bagging Method

In case of classification problems, the final output is decided based on voting. Let us say we have one hundred models in our ensemble, and we have a two-class classification problem with class labels as {+1,-1}. If more than 50 models predict the output as +1, we declare the prediction as +1.

Randomization is another technique by which variability can be introduced in the model building exercise. An example is to pick randomly a subset of attributes for each model in the ensemble. That way, different models will have different sets of attributes. This technique is called the random subspaces method.

With very stable models, Bagging may not achieve very great results. Bagging helps most if the underlying classifier is very sensitive to even small changes to the data. For example, Decision trees, which are very unstable. Unpruned decision trees are a good candidate for Bagging. But say a Nearest Neighbor Classifier, K, is a very stable model. However, we can leverage the random subspaces, and introduce some instability into the nearest neighbor methods.

In the following recipe, you will learn how to leverage Bagging and Random subspaces on a K-Nearest Neighbor algorithm. We will take up a classification problem, and the final prediction will be based on majority voting.

Getting ready…

We will leverage the Scikit learn classes' KNeighborsClassifier for classification and BaggingClassifier for applying the bagging principle. We will generate data for this recipe using the make_classification convenience function.

How to do it

Let us import the necessary libraries, and write a function get_data() to provide us with a dataset to work through this recipe:

from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import classification_report
from sklearn.cross_validation import train_test_split


def get_data():
    """
    Make a sample classification dataset
    Returns : Independent variable y, dependent variable x
    """
    no_features = 30
    redundant_features = int(0.1*no_features)
    informative_features = int(0.6*no_features)
    repeated_features = int(0.1*no_features)
    print no_features,redundant_features,informative_features,repeated_features
    x,y = make_classification(n_samples=500,n_features=no_features,flip_y=0.03,
            n_informative = informative_features, n_redundant = redundant_features 
            ,n_repeated = repeated_features,random_state=7)
    return x,y

Let us proceed to write three functions:

Function build_single_model to make a simple KNearest neighbor model with the given data.

Function build_bagging_model, a function which implements the Bagging routine.

The function view_model to inspect the model that we have built:

def build_single_model(x,y):
    model = KNeighborsClassifier()
    model.fit(x,y)
    return model


def build_bagging_model(x,y):
	bagging = BaggingClassifier(KNeighborsClassifier(),n_estimators=100,random_state=9 
             ,max_samples=1.0,max_features=0.7,bootstrap=True,bootstrap_features=True)
	bagging.fit(x,y)
	return bagging

def view_model(model):
    print "
 Sampled attributes in top 10 estimators
"
    for i,feature_set in  enumerate(model.estimators_features_[0:10]):
        print "estimator %d"%(i+1),feature_set

Finally, we will write our main function, which will call the other functions:

if __name__ == "__main__":
    x,y = get_data()    

    # Divide the data into Train, dev and test    
    x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size = 0.3,random_state=9)
    x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9)
        
    # Build a single model    
    model = build_single_model(x_train,y_train)
    predicted_y = model.predict(x_train)
    print "
 Single Model Accuracy on training data
"
    print classification_report(y_train,predicted_y)
    # Build a bag of models
    bagging = build_bagging_model(x_train,y_train)
    predicted_y = bagging.predict(x_train)
    print "
 Bagging Model Accuracy on training data
"
    print classification_report(y_train,predicted_y)
	view_model(bagging)
    
    # Look at the dev set
    predicted_y = model.predict(x_dev)
    print "
 Single Model Accuracy on Dev data
"
    print classification_report(y_dev,predicted_y)

    print "
 Bagging Model Accuracy on Dev data
"
    predicted_y = bagging.predict(x_dev)
    print classification_report(y_dev,predicted_y)

How it works…

Let us start with the main method. We first call the get_data function to return the dataset as a matrix x of predictors and a vector y for the response variable. Let us look into the get_data function:

    no_features = 30
    redundant_features = int(0.1*no_features)
    informative_features = int(0.6*no_features)
    repeated_features = int(0.1*no_features)
 x,y =make_classification(n_samples=500,n_features=no_features,flip_y=0.03,
n_informative = informative_features, n_redundant = redundant_features 
            ,n_repeated = repeated_features,random_state=7)

Take a look at the parameters passed to the make_classification method. The first parameter is the number of instances required; in this case, we say we need 500 instances. The second parameter is the number of, attributes that are required per instance. We say that we need 30 of them as defined by the variable no_features. The third parameter, flip_y, randomly interchanges 3 percent of the instances. This is done to introduce some noise in our data. The next parameter specifies the number of features out of those 30 that should be informative enough to be used in our classification. We have specified that 60 percent of our features, that is, 18 out of 30 should be informative. The next parameter is about redundant features. These are generated as a linear combination of the informative features in order to introduce a correlation among the features. Finally, repeated features are the duplicate features which are drawn randomly from both informative features and redundant features.

Let us split the data into a training and a testing set using train_test_split. We reserve 30 percent of our data for testing:

    # Divide the data into Train, dev and test    
    x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size = 0.3,random_state=9)

Once again we leverage train_test_split to split our test data into dev and test.

    x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9)

Having divided the data for building, evaluating, and testing the model, we proceed to build our models. We are going to initially build a single model using KNeighborsClassifier by invoking the following:

model = build_single_model(x_train,y_train)

Inside this function, we create an object of type KNeighborsClassifier and fit our data, as follows:

def build_single_model(x,y):
    model = KNeighborsClassifier()
    model.fit(x,y)
    return model

As explained in the previous section, KNearestNeighbor is a very stable algorithm. Let us see how this model performs. We perform our predictions on the training data and look at our model metrics:

    predicted_y = model.predict(x_train)
    print "
 Single Model Accuracy on training data
"
    print classification_report(y_train,predicted_y)

classification_report is a convenient function under the module metric in Scikit learn. It gives a table for precision, recall, and f1-score:

How it works…

Out of 350 instances, our precision is 87 percent. With this figure, let us proceed to build our bagging model:

    bagging = build_bagging_model(x_train,y_train)

We invoke the function build_bagging_model with our training data to build a bag of classifiers, as follows:

def build_bagging_model(x,y):
bagging =             BaggingClassifier(KNeighborsClassifier(),n_estimators=100,random_state=9 
           ,max_samples=1.0,max_features=0.7,bootstrap=True,bootstrap_features=True)
bagging.fit(x,y)
return bagging

Inside the method, we invoke the BaggingClassifier class. Let us look at the arguments that we pass to this class to initialize it.

The first argument is the underlying estimator or model. By passing KNeighborClassifier, we are telling the bagging classifier that we want to build a bag of KNearestNeighbor classifiers. The next parameter specifies the number of estimators that we will build. In this case, we are saying we need 100 of them. The random_state argument is the seed to be used by the random number generator. In order to be consistent during different runs, we set this to an integer value.

Our next parameter is max_samples, where we specify the number of instances to be selected for one estimator when we bootstrap from our input dataset. In this case, we are asking the bagging routine to select all the instances.

Next, the parameter max_features specifies the number of attributes that are to be included while bootstrapping for an estimator. We say that we want to include only 70 percent of the attributes. So for each estimator/model inside the ensemble, it will be using a different subset of the attributes to build the model. This is the random space methodology that we introduced in the previous section. The function proceeds to fit the model and return the model to the calling function.

    bagging = build_bagging_model(x_train,y_train)
    predicted_y = bagging.predict(x_train)
    print "
 Bagging Model Accuracy on training data
"
    print classification_report(y_train,predicted_y)

Let us look at the model accuracy:

How it works…

You can see a big jump in the model metrics.

Before we test our models with our dev dataset, let us look at the attributes that were allocated to the different models, by invoking the view_model function:

    view_model(bagging)

We print the attributes selected for the first ten models, as follows:

def view_model(model):
    print "
 Sampled attributes in top 10 estimators
"
    for i,feature_set in  enumerate(model.estimators_features_[0:10]):
        print "estimator %d"%(i+1),feature_set
How it works…

As you can make out from the result, we have assigned attributes to every estimator pretty much randomly. In this way, we introduced variability into each of our estimator.

Let us proceed to check how our single classifier and bag of estimators have performed in our dev set:

    # Look at the dev set
    predicted_y = model.predict(x_dev)
    print "
 Single Model Accuracy on Dev data
"
    print classification_report(y_dev,predicted_y)

    print "
 Bagging Model Accuracy on Dev data
"
    predicted_y = bagging.predict(x_dev)
    print classification_report(y_dev,predicted_y)
How it works…

As expected, our bag of estimators has performed better in our dev set as compared to our single classifier.

There's more…

As we said earlier, in the case of classification, the label with the majority number of votes is considered as the final prediction. Instead of the voting scheme, we can ask the constituent models to output the prediction probabilities for the labels. An average of the probabilities can be finally taken to decide the final output label. In Scikit's case, the documentation of the API provides the details on how the final prediction is performed:

'The predicted class of an input sample is computed as the class with the highest mean predicted probability. If base estimators do not implement a predict phobia method, then it resorts to voting.'

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html

In the last chapter, we discussed cross validation. Although cross validation may look very similar to bagging, they have different uses in reality. In cross validation, we create K-Folds and, based on the model output from those folds, we may choose our parameters for the model, as we selected the alpha value for ridge regression. This is done primarily to avoid exposing our test data in the model building exercise. Cross validation can be used in Bagging to determine the number of estimators we need to add to our bagging module.

However, a drawback with Bagging is that we loose the interpretability of a model. Consider a Simple Decision tree derived after pruning. It is very easy to explain the decision tree model. But once we have a bag of 100 such models, it become a black box. For increased accuracy, we do a trading of interpretability.

Please refer to the following paper by Leo Breiman for more information about bagging:

Leo Breiman. 1996. Bagging predictors.Mach. Learn.24, 2 (August 1996), 123-140. DOI=10.1023/A:1018054314350 http://dx.doi.org/10.1023/A:1018054314350

See also

  • Using cross validation iterators recipe in Chapter 7, Machine Learning 2
  • Building Decision Trees to solve Multi-Class Problems recipe in Chapter 6, Machine Learning 1
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset