We're going to work with some ideas similar to those we saw in the recipe on Lasso Regression. In that recipe, we looked at the number of features that had zero coefficients.
Now we're going to take this a step further and use the spareness associated with L1 norms to preprocess the features.
We'll use the diabetes dataset to fit a regression. First, we'll fit a basic LinearRegression
model with a ShuffleSplit
cross validation. After we do that, we'll use LassoRegression
to find the coefficients that are 0
when using an L1
penalty. This hopefully will help us to avoid overfitting, which means that the model is too specific to the data it was trained on. To put this another way, the model, if overfit, does not generalize well to outside data.
We're going to perform the following steps:
>>> import sklearn.datasets as ds >>> diabetes = ds.load_diabetes()
Let's create the LinearRegression
object:
>>> from sklearn import linear_model >>> lr = linear_model.LinearRegression()
Let's also import the metrics module for the mean_squared_error
function and the cross_validation
module for the ShuffleSplit
cross validation scheme:
>>> from sklearn import metrics >>> from sklearn import cross_validation >>> shuff = cross_validation.ShuffleSplit(diabetes.target.size)
Now, let's fit the model, and we'll keep track of the mean squared error for each iteration of ShuffleSplit
:
>>> mses = [] >>> for train, test in shuff: train_X = diabetes.data[train] train_y = diabetes.target[train] test_X = diabetes.data[~train] test_y = diabetes.target[~train] lr.fit(train_X, train_y) mses.append(metrics.mean_squared_error(test_y, lr.predict(test_X))) >>> np.mean(mses) 2856.366626198198
So now that we have the regular fit, let's check it after we eliminate any features with a zero for the coefficient. Let's fit the Lasso Regression:
>>> from sklearn import feature_selection >>> from sklearn import cross_validation >>> cv = linear_model.LassoCV() >>> cv.fit(diabetes.data, diabetes.target) >>> cv.coef_ array([ -0. , -226.2375274 , 526.85738059, 314.44026013, -196.92164002, 1.48742026, -151.78054083, 106.52846989, 530.58541123, 64.50588257])
We'll remove the first feature, I'll use a NumPy array to represent the columns that are to be included in the model:
>>> import numpy as np >>> columns = np.arange(diabetes.data.shape[1])[cv.coef_ != 0] >>> columns array([1, 2, 3 4, 5, 6, 7, 8, 9])
Okay, so now we'll fit the model with the specific features (see the columns in the following code block):
>>> l1mses = [] >>> for train, test in shuff: train_X = diabetes.data[train][:, columns] train_y = diabetes.target[train] test_X = diabetes.data[~train][:, columns] test_y = diabetes.target[~train] lr.fit(train_X, train_y) l1mses.append(metrics.mean_squared_error(test_y, lr.predict(test_X))) >>> np.mean(l1mses) 2861.0763924492171 >>> np.mean(l1mses) - np.mean(mses) 4.7097662510191185
As we can see, even though we get an uninformative feature, the model still fits worse. This isn't always the case. In the next section, we'll compare a fit between models where there are many uninformative features.
First, we're going to create a regression dataset with many uninformative features:
>>> X, y = ds.make_regression(noise=5)
Let's fit a normal regression:
>>> mses = [] >>> shuff = cross_validation.ShuffleSplit(y.size) >>> for train, test in shuff: train_X = X[train] train_y = y[train] test_X = X[~train] test_y = y[~train] lr.fit(train_X, train_y) mses.append(metrics.mean_squared_error(test_y, lr.predict(test_X))) >>> np.mean(mses) 879.75447864034209
Now, we can walk through the same process for Lasso regression:
>>> cv.fit(X, y) LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False, precompute='auto', tol=0.0001, verbose=False)
We'll create the columns again. This is a nice pattern that will allow us to specify the features we want to include:
>>> import numpy as np >>> columns = np.arange(X.shape[1])[cv.coef_ != 0] >>> columns[:5] array([11, 15, 17, 20, 21,]) >>> mses = [] >>> shuff = cross_validation.ShuffleSplit(y.size) >>> for train, test in shuff: train_X = X[train][:, columns] train_y = y[train] test_X = X[~train][:, columns] test_y = y[~train] lr.fit(train_X, train_y) mses.append(metrics.mean_squared_error(test_y, lr.predict(test_X))) >>> np.mean(mses) 15.755403220117708
As we can see, we get an extreme improvement in the fit of the model. This just exemplifies that we need to be cognizant that not all the models need to be or should be thrown into the model.