Using ridge regression to overcome linear regression's shortfalls

In this recipe, we'll learn about ridge regression. It is different from vanilla linear regression; it introduces a regularization parameter to "shrink" the coefficients. This is useful when the dataset has collinear factors.

Getting ready

Let's load a dataset that has a low effective rank and compare ridge regression with linear regression by way of the coefficients. If you're not familiar with rank, it's the smaller of the linearly independent columns and the linearly independent rows. One of the assumptions of linear regression is that the data matrix is of "full rank".

How to do it...

First, use make_regression to create a simple dataset with three predictors, but an effective rank of 2. Effective rank means that while technically the matrix is of full rank, many of the columns have a high degree of colinearity:

>>> from sklearn.datasets import make_regression
>>> reg_data, reg_target = make_regression(n_samples=2000, 
                           n_features=3, effective_rank=2, noise=10)

First, let's take a look at regular linear regression:

>>> import numpy as np
>>> n_bootstraps = 1000
>>> len_data = len(reg_data)
>>> subsample_size = np.int(0.75*len_data)
>>> subsample = lambda: np.random.choice(np.arange(0, len_data), 
                        size=subsample_size)
>>> coefs = np.ones((n_bootstraps, 3))

>>> for i in range(n_bootstraps):
       subsample_idx = subsample()
       subsample_X = reg_data[subsample_idx]
       subsample_y = reg_target[subsample_idx]

>>> lr.fit(subsample_X, subsample_y)

>>> coefs[i][0] = lr.coef_[0]
>>> coefs[i][1] = lr.coef_[1]
>>> coefs[i][2] = lr.coef_[2]

The following is the output that gets generated:

How to do it...

Follow the same procedure with Ridge, and have a look at the output:

>>> r = Ridge()
>>> n_bootstraps = 1000
>>> len_data = len(reg_data)
>>> subsample_size = np.int(0.75*len_data)
>>> subsample = lambda: np.random.choice(np.arange(0, len_data), 
    size=subsample_size)

coefs_r = np.ones((n_bootstraps, 3))
# carry out the same procedure from above

The following is the output that gets generated:

How to do it...

Don't let the similar width of the plots fool you; the coefficients for ridge regression are much closer to 0. Let's look at the average spread between the coefficients:

>>> np.mean(coefs - coefs_r, axis=0)
#coefs_r stores the ridge regression coefficients
array([ 22.19529525,  49.54961002,   8.27708536])

So, on an average, the coefficients for linear regression are much higher than the ridge regression coefficients. This difference is the bias in the coefficients (forgetting, for a second, the potential bias of the linear regression coefficients). So then, what is the advantage of ridge regression? Well, let's look at the variance of our coefficients:

>>> np.var(coefs, axis=0)
array([ 184.50845658,  150.16268077,  263.39096391])

>>> np.var(coefs_r, axis=0)
array([ 21.35161646,  23.95273241,  17.34020101])

The variance has been dramatically reduced. This is the bias-variance trade-off that is so often discussed in machine learning. The next recipe will introduce how to tune the regularization parameter in ridge regression, which is at the heart of this trade-off.

How it works...

Speaking of the regularization parameter, let's go through how ridge regression differs from linear regression. As was already shown, linear regression works, but it finds the vector of betas that minimize How it works....

Ridge regression finds the vector of betas that minimize How it works....

How it works... is typically How it works..., or it's some scalar times the identity matrix. We actually used the default alpha when initializing ridge regression.

Now that we created the object, we can look at its attributes:

>>> r #notice the alpha parameter
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None, 
      normalize=False, solver='auto', tol=0.001)

This minimization has the following solution:

How it works...

The previous solution is the same as linear regression, except for the How it works... term. For a matrix A, How it works... is symmetric, and thus positive semidefinite. So, thinking about the translation of matrix algebra from scalar algebra, we effectively divide by a larger number. Multiplication by an inverse is analogous to division. So, this is what squeezes the coefficients towards 0. This is a bit of a crude explanation; for a deeper understanding, you should look at the connections between SVD and ridge regression.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset