Gradient boosting classifier

Gradient boosting is one of the competition-winning algorithms that work on the principle of boosting weak learners iteratively by shifting focus towards problematic observations that were difficult to predict in previous iterations and performing an ensemble of weak learners, typically decision trees. It builds the model in a stage-wise fashion as other boosting methods do, but it generalizes them by allowing optimization of an arbitrary differentiable loss function.

Let's start understanding Gradient Boosting with a simple example, as GB challenges many data scientists in terms of understanding the working principle:

Initially, we fit the model on observations producing 75% accuracy and the remaining unexplained variance is captured in the error term:

Then we will fit another model on the error term to pull the extra explanatory component and add it to the original model, which should improve the overall accuracy:

Now, the model is providing 80% accuracy and the equation looks as follows:

We continue this method one more time to fit a model on the error2 component to extract a further explanatory component:

Now, model accuracy is further improved to 85% and the final model equation looks as follows:

Here, if we use weighted average (higher importance given to better models that predict results with greater accuracy than others) rather than simple addition, it will improve the results further. In fact, this is what the gradient boosting algorithm does!

After incorporating weights, the name of the error changed from error3 to error4, as both errors may not be exactly the same. If we find better weights, we will probably get an accuracy of 90% instead of simple addition, where we have only got 85%.

Gradient boosting involves three elements:

Loss function to be optimized: Loss function depends on the type of problem being solved. In the case of regression problems, mean squared error is used, and in classification problems, the logarithmic loss will be used. In boosting, at each stage, unexplained loss from prior iterations will be optimized rather than starting from scratch.
Weak learner to make predictions: Decision trees are used as a weak learner in gradient boosting.
Additive model to add weak learners to minimize the loss function: Trees are added one at a time and existing trees in the model are not changed. The gradient descent procedure is used to minimize the loss when adding trees.

The algorithm for Gradient boosting consists of the following steps:

Initialize:

For m = 1 to M:
- a) For i = 1, 2, …, N compute:

- b) Fit a regression tree to the targets r_im giving terminal regions R_jm, j = 1, 2, …, J_m,
- c) For j = 1, 2, …, J_m, compute:

- d) Update:

Output:

Initializes the constant optimal constant model, which is just a single terminal node that will be utilized as a starting point to tune it further in the next steps. (2a), calculates the residuals/errors by comparing actual outcome with predicted results, followed by (2b and 2c) in which the next decision tree will be fitted on error terms to bring in more explanatory power to the model, and in (2d) add the extra component to the model at last iteration. Finally, ensemble all weak learners to create a strong learner.

Table of Contents for Gradient boosting classifier

Create new playlist

Sign In

Sign Up

Table of Contents for
Gradient boosting classifier