Earlier in this book, I introduced the idea of the PAC learning model and the idea of concept classes. A related idea is that of weak learnability. Here each of the learning algorithms in the ensemble need only perform slightly better than chance. For example if each algorithm in the ensemble is correct at least 51% of the time then the criteria of weak learnability are satisfied. It turns out that the ideas of PAC and weak learnability are essentially the same except that for the latter, we drop the requirement that the algorithm must achieve arbitrarily high accuracy. However, it merely performs better than a random hypothesis. How is this useful, you may ask? It is often easier to find rough rules of thumb rather than a highly accurate prediction rule. This weak learning model may only perform slightly better than chance; however, if we boost this learner by running it many times on different weighted distributions of the data and by combining these learners, we can, hopefully, build a single prediction rule that performs much better than any of the individual weak learning rules.
Boosting is a simple and powerful idea. It extends bagging by taking into account a model's training error. For example, if we train a linear classifier and find that it misclassified a certain set of instances. If we train a subsequent model on a dataset containing duplicates of these misclassified instances, then we would expect that this newly trained model would perform better on a test set. By including duplicates of misclassified instances in the training set, we are shifting the mean of the data set towards these instances. This forces the learner to focus on the most difficult-to-classify examples. This is achieved in practice by giving misclassified instances higher weight and then modifying the model to take this in to account, for example, in a linear classifier we can calculate the class means by using weighted averages.
Starting from a dataset of uniform weights that sum to one, we run the classifier and will likely misclassify some instances. To boost the weight of these instances, we assign them half the total weight. For example, consider a classifier that gives us the following results:
Predicted positive |
Predicted negative |
Total | |
---|---|---|---|
Actual pos. |
24 |
16 |
40 |
Actual neg. |
9 |
51 |
60 |
Totals |
33 |
67 |
100 |
The error rate is ɛ = (9 + 16)/100 = 0.25.
We want to assign half the error weight to the misclassified samples, and since we started with uniform weights that sum to 1, the current weight assigned to the misclassified examples is simply the error rate. To update the weights, therefore, we multiply them by the factor 1/2ɛ. Assuming that the error rate is less than 0.5, this results in an increase in the weights of the misclassified examples. To ensure that the weights still sum to 1, we multiply the correctly classified examples by ½(1-ɛ). In this example, the error rate, the initial weights of the incorrectly classified samples, is .25 and we want it to be .5, that is, half the total weights, so we multiply this initial error rate by 2. The weights for the correctly classified instances are 1/2(1-ɛ) = 2/3. Taking these weights into account results into the following table:
Predicted positive |
Predicted negative |
Total | |
---|---|---|---|
Actual pos. |
16 |
32 |
48 |
Actual neg. |
18 |
34 |
60 |
Total |
33 |
67 |
100 |
The final piece we need is a confidence factor, α, which is applied to each model in the ensemble. This is used to make an ensemble prediction based on the weighted averages from each individual model. We want this to increase with decreasing errors. A common way to ensure this happens is to set the confidence factor to the following:
So we are given a dataset, such as following:
We then initialize an equal weighted distribution, such as the following:
Using a weak classifier, ht ,we can write an updated rule as follows:
With the normalization factor, such as the following:
Note that exp(-yi ht(xi)) is positive and greater than 1 if -yi ht(xi) is positive, and this happens if xi is misclassified. The result is that the update rule will increase the weight of a misclassified example and decrease the weight of correctly classified samples.
We can write the final classifier as follows:
One of the most popular boosting algorithms is called AdaBoost or adaptive boosting. Here, a decision tree classifier is used as the base learner and it builds a decision boundary on data that is not linearly separable:
import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import make_blobs plot_colors = "br" plot_step = 0.02 class_names = "AB" tree= DecisionTreeClassifier() boost=AdaBoostClassifier() X,y=make_blobs(n_samples=500,centers=2, random_state=0, cluster_std=2) boost.fit(X,y) plt.figure(figsize=(10, 5)) # Plot the decision boundaries plt.subplot(121) x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step)) Z = boost.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired) plt.axis("tight") for i, n, c in zip(range(2), class_names, plot_colors): idx = np.where(y == i) plt.scatter(X[idx, 0], X[idx, 1], c=c, cmap=plt.cm.Paired, label="Class %s" % n) plt.title('Decision Boundary') twoclass_output = boost.decision_function(X) plot_range = (twoclass_output.min(), twoclass_output.max()) plt.subplot(122) for i, n, c in zip(range(2), class_names, plot_colors): plt.hist(twoclass_output[y == i], bins=20, range=plot_range, facecolor=c, label='Class %s' % n, alpha=.5) x1, x2, y1, y2 = plt.axis() plt.axis((x1, x2, y1, y2)) plt.legend(loc='upper left') plt.ylabel('Samples') plt.xlabel('Score') plt.title('Decision Scores') plt.show() print("Mean Accuracy =%f" % boost.score(X,y))
Gradient tree boosting is a very useful algorithm for both regression and classification problems. One of its major advantages is that it naturally handles mixed data types, and it is also quite robust to outliers. Additionally, it has better predictive powers than many other algorithms; however, its sequential architecture makes it unsuitable for parallel techniques, and therefore, it does not scale well to large data sets. For datasets with a large number of classes, it is recommended to use RandomForestClassifier
instead. Gradient boosting typically uses decision trees to build a prediction model based on an ensemble of weak learners, applying an optimization algorithm on the cost function.
In the following example, we create a function that builds a gradient boosting classifier and graphs its cumulative loss versus the number of iterations. The GradientBoostingClassifier
class has an oob_improvement_
attribute and is used here calculate an estimate of the test loss on each iteration. This gives us a reduction in the loss compared to the previous iteration. This can be a very useful heuristic for determining the number of optimum iterations. Here, we plot the cumulative improvement of two gradient boosting classifiers. Each classifier is identical but for a different learning rate, .01 in the case of the dotted line and .001 for the solid line.
The learning rate shrinks the contribution of each tree, and this means that there is a tradeoff with the number of estimators. Here, we actually see that with a larger learning rate, the model appears to reach its optimum performance faster than the model with a lower learning rate. However, this models appears to achieve better results overall. What usually occurs in practice is that oob_improvement
deviates in a pessimistic way over a large number of iterations. Let's take a look at the following commands:
import numpy as np import matplotlib.pyplot as plt from sklearn import ensemble from sklearn.cross_validation import train_test_split from sklearn import datasets def gbt(params, X,y,ls): clf = ensemble.GradientBoostingClassifier(**params) clf.fit(X_train, y_train) cumsum = np.cumsum(clf.oob_improvement_) n = np.arange(params['n_estimators']) oob_best_iter = n[np.argmax(cumsum)] plt.xlabel('Iterations') plt.ylabel('Improvement') plt.axvline(x=oob_best_iter,linestyle=ls) plt.plot(n, cumsum, linestyle=ls) X,y=datasets.make_blobs(n_samples=50,centers=5, random_state=0, cluster_std=5) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=9) p1 = {'n_estimators': 1200, 'max_depth': 3, 'subsample': 0.5, 'learning_rate': 0.01, 'min_samples_leaf': 1, 'random_state': 3} p2 = {'n_estimators': 1200, 'max_depth': 3, 'subsample': 0.5, 'learning_rate': 0.001, 'min_samples_leaf': 1, 'random_state': 3} gbt(p1, X,y, ls='--') gbt(p2, X,y, ls='-')