Bagging, also called bootstrap aggregating, comes in a few flavors and these are defined by the way they draw random subsets from the training data. Most commonly, bagging refers to drawing samples with replacement. Because the samples are replaced, it is possible for the generated datasets to contain duplicates. It also means that data points may be excluded from a particular generated dataset, even if this generated set is the same size as the original. Each of the generated datasets will be different and this is a way to create diversity among the models in an ensemble. We can calculate the probability that a data point is not selected in a sample using the following example:
Here, n is the number of bootstrap samples. Each of the n bootstrap samples results in a different hypothesis. The class is predicted either by averaging the models or by choosing the class predicted by the majority of models. Consider an ensemble of linear classifiers. If we use majority voting to determine the predicted class, we create a piece-wise linear classifier boundary. If we transform the votes to probabilities, then we partition the instance space into segments that can each potentially have a different score.
It should also be mentioned that it is possible, and sometimes desirable, to use random subsets of features; this is called subspace sampling. Bagging estimators work best with complex models such as fully developed decision trees because they can help reduce overfitting. They provide a simple, out-of-the-box, way to improve a single model.
Scikit-learn implements a BaggingClassifier
and BaggingRegressor
objects. Here are some of their most important parameters:
As an example, the following snippet instantiates a bagging classifier comprising of 50 decision tree classifier base estimators each built on random subsets of half the features and half the samples:
from sklearn.ensemble import BaggingClassifier from sklearn.tree import DecisionTreeClassifier from sklearn import datasets bcls=BaggingClassifier(DecisionTreeClassifier(),max_samples=0.5, max_features=0.5, n_estimators=50) X,y=datasets.make_blobs(n_samples=8000,centers=2, random_state=0, cluster_std=4) bcls.fit(X,y) print(bcls.score(X,y))
Tree-based models are particularly well suited to ensembles, primarily because they can be sensitive to changes in the training data. Tree models can be very effective when used with subspace sampling, resulting in more diverse models and, since each model in the ensemble is working on only a subset of the features, it reduces the training time. This builds each tree using a different random subset of the features and is therefore called a random forest.
A random forest partitions an instance space by finding the intersection of the partitions in the individual trees in the forest. It defines a partition that can be finer, that is, will take in more detail, than a partition created by any individual tree in the forest. In principle, a random forest can be mapped back to an individual tree, since each intersection corresponds to combining the branches of two different trees. The random forest can be thought of as essentially an alternative training algorithm for tree-based models. A linear classifier in a bagging ensemble is able to learn a complicated decision boundary that would be impossible for a single linear classifier to learn.
The sklearn.ensemble
module has two algorithms based on decision trees, random forests and extremely randomized trees. They both create diverse classifiers by introducing randomness into their construction and both include classes for classification and regression. With the RandomForestClassifier
and RandomForestRegressor
class each tree is built using bootstrap samples. The split chosen by the model is not the best split among all features, but is chosen from a random subset of features.
The extra trees
method, as with random forests, uses a random subset of features, but instead of using the most discriminative thresholds, the best of a randomly generated set of thresholds is used. This acts to reduce variance at the expense of a small increase in bias. The two classes are ExtraTreesClassifier
and ExtraTreesRegressor
.
Let's take a look at an example of the random forest
classifier and the extra trees
classifier. In this example, we use VotingClassifier
to combine different classifiers. The voting classifier can help balance out an individual model's weakness. In this example, we pass four weights to the function. These weights determine each individual model's contribution to the overall result. We can see that the two tree models overfit the training data, but also tend to perform better on the test data. We can also see that ExtraTreesClassifier
achieved slightly better results on the test set compared to the RandomForest
object. Also, the VotingClasifier
object performed better on the test set than all its constituent classifiers. It is worth, while running this with different weightings as well as on different datasets, seeing how the performance of each model changes:
from sklearn import cross_validation import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import VotingClassifier from sklearn import datasets def vclas(w1,w2,w3, w4): X , y = datasets.make_classification(n_features= 10, n_informative=4, n_samples=500, n_clusters_per_class=5) Xtrain,Xtest, ytrain,ytest= cross_validation.train_test_split(X,y,test_size=0.4) clf1 = LogisticRegression(random_state=123) clf2 = GaussianNB() clf3 = RandomForestClassifier(n_estimators=10,bootstrap=True, random_state=123) clf4= ExtraTreesClassifier(n_estimators=10, bootstrap=True,random_state=123) clfes=[clf1,clf2,clf3,clf4] eclf = VotingClassifier(estimators=[('lr', clf1), ('gnb', clf2), ('rf', clf3),('et',clf4)], voting='soft', weights=[w1, w2, w3,w4]) [c.fit(Xtrain, ytrain) for c in (clf1, clf2, clf3,clf4, eclf)] N = 5 ind = np.arange(N) width = 0.3 fig, ax = plt.subplots() for i, clf in enumerate(clfes): print(clf,i) p1=ax.bar(i,clfes[i].score(Xtrain,ytrain,), width=width,color="black") p2=ax.bar(i+width,clfes[i].score(Xtest,ytest,), width=width,color="grey") ax.bar(len(clfes)+width,eclf.score(Xtrain,ytrain,), width=width,color="black") ax.bar(len(clfes)+width *2,eclf.score(Xtest,ytest,), width=width,color="grey") plt.axvline(3.8, color='k', linestyle='dashed') ax.set_xticks(ind + width) ax.set_xticklabels(['LogisticRegression', 'GaussianNB', 'RandomForestClassifier', 'ExtraTrees', 'VotingClassifier'], rotation=40, ha='right') plt.title('Training and test score for different classifiers') plt.legend([p1[0], p2[0]], ['training', 'test'], loc='lower left') plt.show() vclas(1,3,5,4)
You will observe the following output:
Tree models allow us to assess the relative rank of features in terms of the expected fraction of samples they contribute to. Here, we use one to evaluate the importance of each features in a classification task. A feature's relative importance is based on where it is represented in the tree. Features at the top of a tree contribute to the final decision of a larger proportion of input samples.
The following example uses an ExtraTreesClassifier
class to map feature importance. The dataset we are using consists of 10 images, each of 40 people, which is 400 images in total. Each image has a label indicating the person's identity. In this task, each pixel is a feature; in the output, the pixel's brightness represents the feature's relative importance. The brighter the pixel, the more important the features. Note that in this model, the brightest pixels are in the forehead region and we should be careful how we interpret this. Since most photographs are illuminated from above the head, the apparently high importance of these pixels may be due to the fact that foreheads tend to be better illuminated, and therefore reveal more detail about an individual, rather than the intrinsic properties of a person's forehead in indicating their identity:
import matplotlib.pyplot as plt from sklearn.datasets import fetch_olivetti_faces from sklearn.ensemble import ExtraTreesClassifier data = fetch_olivetti_faces() def importance(n_estimators=500, max_features=128, n_jobs=3, random_state=0): X = data.images.reshape((len(data.images), -1)) y = data.target forest = ExtraTreesClassifier(n_estimators,max_features=max_features, n_jobs=n_jobs, random_state=random_state) forest.fit(X, y) dstring=" cores=%d..." % n_jobs + " features=%s..." % max_features +"estimators=%d..." %n_estimators + "random=%d" %random_state print(dstring) importances = forest.feature_importances_ importances = importances.reshape(data.images[0].shape) plt.matshow(importances, cmap=plt.cm.hot) plt.title(dstring) #plt.savefig('etreesImportance'+ dstring + '.png') plt.show() importance()
The output for the preceding code is as follows: