Gradient boosting regression is a technique that learns from its mistakes. Essentially, it tries to fit a bunch of weak learners. There are two things to note:
Let's use some basic regression data and see how gradient boosting regression (henceforth, GBR) works:
>>> from sklearn.datasets import make_regression >>> X, y = make_regression(1000, 2, noise=10)
GBR is part of the ensemble module because it's an ensemble learner. This is the name for the idea behind using many weak learners to simulate a strong learner:
>>> from sklearn.ensemble import GradientBoostingRegressor as GBR >>> gbr = GBR() >>> gbr.fit(X, y) >>> gbr_preds = gbr.predict(X)
Clearly, there's more to fitting a usable model, but this pattern should be pretty clear by now.
Now, let's fit a basic regression as well so that we can use it as the baseline:
>>> from sklearn.linear_model import LinearRegression >>> lr = LinearRegression() >>> lr.fit(X, y) >>> lr_preds = lr.predict(X)
Now that we have a baseline, let's see how well GBR performed against linear regression.
I'll leave it as an exercise for you to plot the residuals, but to get started, do the following:
>>> gbr_residuals = y - gbr_preds >>> lr_residuals = y - lr_preds
The following will be the output:
It looks like GBR has a better fit, but it's a bit hard to tell. Let's take the 95 percent CI and compare:
>>> np.percentile(gbr_residuals, [2.5, 97.5]) array([-16.05443674, 17.53946294]) >>> np.percentile(lr_residuals, [2.5, 97.5]) array([-20.05434912, 19.80272884])
So, GBR clearly fits a bit better; we can also make several modifications to the GBR algorithm, which might improve performance. I'll show an example here, then we'll walkthrough the different options in the How it works... section:
>>> n_estimators = np.arange(100, 1100, 350) >>> gbrs = [GBR(n_estimators=n_estimator) for n_estimator in n_estimators] >>> residuals = {} >>> for i, gbr in enumerate(gbrs): gbr.fit(X, y) residuals[gbr.n_estimators] = y - gbr.predict(X)
The following is the output:
It's a bit muddled, but hopefully, it's clear that as the number of estimators increases, the error goes down. Sadly, this isn't a panacea; first, we don't test against a holdout set, and second, as the number of estimators goes up, the training time takes longer. This isn't a big deal on the dataset we use here, but imagine one or two magnitudes higher.
The first parameter, and the one we already looked at, is n_estimators
—the number of weak learners that are used in GBR. In general, if you can get away with more (that is, have enough computational power), it is probably better. There are more nuances to the other parameters.
You should tune the max_depth
parameter before all others. Since the individual learners are trees, max_depth
controls how many nodes are produced for the trees. There's a subtle line between using the appropriate number of nodes that can fit the data well and using too many, which might cause overfitting.
The loss
parameter controls the loss
function, which determines the error. The ls
parameter is the default, and stands for least squares. Least absolute deviation, Huber loss, and quantiles are also available.