The AdaBoost algorithm

AdaBoost was the first boosting algorithm to iteratively adapt to the cumulative learning progress when fitting an additional ensemble member. In particular, AdaBoost changed the weights on the training data to reflect the cumulative errors of the current ensemble on the training set before fitting a new weak learner. AdaBoost was the most accurate classification algorithm at the time, and Leo Breiman referred to it as the best off-the-shelf classifier in the world at the 1996 NIPS conference.

The algorithm had a very significant impact on ML because it provided theoretical performance guarantees. These guarantees only require sufficient data and a weak learner that reliably predicts just better than a random guess. As a result of this adaptive method that learns in stages, the development of an accurate ML model no longer required accurate performance over the entire feature space. Instead, the design of a model could focus on finding weak learners that just outperformed a coin flip.

AdaBoost is a significant departure from bagging, which builds ensembles on very deep trees to reduce bias. AdaBoost, in contrast, grows shallow trees as weak learners, often producing superior accuracy with stumps—that is, trees formed by a single split. The algorithm starts with an equal-weighted training set and then successively alters the sample distribution. After each iteration, AdaBoost increases the weights of incorrectly classified observations and reduces the weights of correctly predicted samples so that subsequent weak learners focus more on particularly difficult cases. Once trained, the new decision tree is incorporated into the ensemble with a weight that reflects its contribution to reducing the training error.

The AdaBoost algorithm for an ensemble of base learners, hm(x), m=1, ..., M, that predict discrete classes, y ∈ [-1, 1], and N training observations can be summarized as follows:

  1. Initialize sample weights wi=1/N for observations i=1, ..., N.
  2. For each base classifier hm, m=1, ..., M, do the following:
    1. Fit hm(x) to the training data, weighted by wi.
    2. Compute the base learner's weighted error rate εm on the training set.
    3. Compute the base learner's ensemble weight αm as a function of its error rate, as shown in the following formula:

    1. Update the weights for misclassified samples according to w* exp(αm).
  1. Predict the positive class when the weighted sum of the ensemble members is positive, and negative otherwise, as shown in the following formula:

AdaBoost has many practical advantages, including ease of implementation and fast computation, and it can be combined with any method for identifying weak learners. Apart from the size of the ensemble, there are no hyperparameters that require tuning. AdaBoost is also useful for identifying outliers because the samples that receive the highest weights are those that are consistently misclassified and inherently ambiguous, which is also typical for outliers. 

On the other hand, the performance of AdaBoost on a given dataset depends on the ability of the weak learner to adequately capture the relationship between features and outcome. As the theory suggests, boosting will not perform well when there is insufficient data, or when the complexity of the ensemble members is not a good match for the complexity of the data. It can also be susceptible to noise in the data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset