Feature selection on L1 norms

We're going to work with some ideas similar to those we saw in the recipe on Lasso Regression. In that recipe, we looked at the number of features that had zero coefficients.

Now we're going to take this a step further and use the spareness associated with L1 norms to preprocess the features.

Getting ready

We'll use the diabetes dataset to fit a regression. First, we'll fit a basic LinearRegression model with a ShuffleSplit cross validation. After we do that, we'll use LassoRegression to find the coefficients that are 0 when using an L1 penalty. This hopefully will help us to avoid overfitting, which means that the model is too specific to the data it was trained on. To put this another way, the model, if overfit, does not generalize well to outside data.

We're going to perform the following steps:

  1. Load the dataset.
  2. Fit a basic linear regression model.
  3. Use feature selection to remove uninformative features.
  4. Refit the linear regression and check to see how well it fits compared with the fully featured model.

How to do it...

First, let's get the dataset:

>>> import sklearn.datasets as ds
>>> diabetes = ds.load_diabetes()

Let's create the LinearRegression object:

>>> from sklearn import linear_model
>>> lr = linear_model.LinearRegression()

Let's also import the metrics module for the mean_squared_error function and the cross_validation module for the ShuffleSplit cross validation scheme:

>>> from sklearn import metrics
>>> from sklearn import cross_validation


>>> shuff = cross_validation.ShuffleSplit(diabetes.target.size)

Now, let's fit the model, and we'll keep track of the mean squared error for each iteration of ShuffleSplit:

>>> mses = []
>>> for train, test in shuff:
       train_X = diabetes.data[train]
       train_y = diabetes.target[train]
    
       test_X = diabetes.data[~train]
       test_y = diabetes.target[~train]
     
       lr.fit(train_X, train_y)
    
       mses.append(metrics.mean_squared_error(test_y, 
                   lr.predict(test_X)))
    
>>> np.mean(mses)

2856.366626198198

So now that we have the regular fit, let's check it after we eliminate any features with a zero for the coefficient. Let's fit the Lasso Regression:

>>> from sklearn import feature_selection
>>> from sklearn import cross_validation


>>> cv = linear_model.LassoCV()
>>> cv.fit(diabetes.data, diabetes.target)
>>> cv.coef_

array([ -0. , -226.2375274 ,  526.85738059,  314.44026013,
        -196.92164002, 1.48742026, -151.78054083, 106.52846989,
        530.58541123, 64.50588257])

We'll remove the first feature, I'll use a NumPy array to represent the columns that are to be included in the model:

>>> import numpy as np
>>> columns = np.arange(diabetes.data.shape[1])[cv.coef_ != 0]
>>> columns
array([1, 2, 3 4, 5, 6, 7, 8, 9])

Okay, so now we'll fit the model with the specific features (see the columns in the following code block):

>>> l1mses = []

>>> for train, test in shuff:
       train_X = diabetes.data[train][:, columns]
       train_y = diabetes.target[train]
    
       test_X = diabetes.data[~train][:, columns]
       test_y = diabetes.target[~train]
    
       lr.fit(train_X, train_y)
    
       l1mses.append(metrics.mean_squared_error(test_y, 
                     lr.predict(test_X)))

>>> np.mean(l1mses)
2861.0763924492171
>>> np.mean(l1mses) - np.mean(mses)
4.7097662510191185

As we can see, even though we get an uninformative feature, the model still fits worse. This isn't always the case. In the next section, we'll compare a fit between models where there are many uninformative features.

How it works...

First, we're going to create a regression dataset with many uninformative features:

>>> X, y = ds.make_regression(noise=5)

Let's fit a normal regression:

>>> mses = []

>>> shuff = cross_validation.ShuffleSplit(y.size)

>>> for train, test in shuff:
       train_X = X[train]
       train_y = y[train]
    
       test_X = X[~train]
       test_y = y[~train]
    
       lr.fit(train_X, train_y)
    
       mses.append(metrics.mean_squared_error(test_y, 
                   lr.predict(test_X)))
>>> np.mean(mses)

879.75447864034209

Now, we can walk through the same process for Lasso regression:

>>> cv.fit(X, y)

LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, 
        fit_intercept=True, max_iter=1000, n_alphas=100, 
        n_jobs=1, normalize=False, positive=False, precompute='auto', 
        tol=0.0001, verbose=False)

We'll create the columns again. This is a nice pattern that will allow us to specify the features we want to include:

>>> import numpy as np
>>> columns = np.arange(X.shape[1])[cv.coef_ != 0]
>>> columns[:5]
array([11, 15, 17, 20, 21,])

>>> mses = []
>>> shuff = cross_validation.ShuffleSplit(y.size)

>>> for train, test in shuff:
       train_X = X[train][:, columns]
       train_y = y[train]
    
       test_X = X[~train][:, columns]
       test_y = y[~train]
    
       lr.fit(train_X, train_y)
    
       mses.append(metrics.mean_squared_error(test_y, 
                   lr.predict(test_X)))
    
>>> np.mean(mses)

15.755403220117708

As we can see, we get an extreme improvement in the fit of the model. This just exemplifies that we need to be cognizant that not all the models need to be or should be thrown into the model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset