Learning regression with L1 shrinkage – LASSO

Least absolute shrinkage and selection operator (LASSO) is another shrinkage method popularly used with regression problems. LASSO leads to sparse solutions compared with ridge. A solution is called sparse if most of the coefficients are reduced to zero. In LASSO, a lot of the coefficients are made zero. In the case of correlated variables, LASSO selects only one of them, whereas ridge assigns equal weights to the coefficients of both the variables. This attribute of LASSO can hence be leveraged for variable selection. In this recipe, let's see how we can leverage LASSO for variable selection.

Let's look at the cost function of LASSO regression. If you followed through the previous two recipes, you can quickly identify the difference:

Learning regression with L1 shrinkage – LASSO

The coefficients are penalized by the sum of the absolute value of the coefficients. Once again, the alpha controls the level of penalization. Let's try to understand the intuition behind why L1 shrinkage leads to a sparse solution.

We can rewrite the preceding equation as an unconstrained cost function and a constraint, as follows:

Minimize:

Learning regression with L1 shrinkage – LASSO

Subject to the constraint:

Learning regression with L1 shrinkage – LASSO

With this equation in mind, let's plot the cost function values in the coefficient space for two coefficients, w0 and w1:

Learning regression with L1 shrinkage – LASSO

The blue lines represent the contours of the cost function (without constraint) values for the different values of w0 and w1. The green region represents the constraint shape dictated by the eta value. The optimized value where both the regions meet is when w0 is set to 0. We depicted a two-dimensional space where our solution is made sparse with w0 set to 0. In a multidimensional space, we will have a rhomboid in the green region, and LASSO will give a sparse solution by reducing many of the coefficients to zero.

Getting ready

Once again, we will use the Boston dataset to demonstrate LASSO regression. The Boston data has 13 attributes and 506 instances. The target variable is a real number and the median value of the houses is in the thousands.

Refer to the following UCI link for more information on the Boston dataset:

https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names.

We will see how we can use LASSO for variable selection.

How to do it…

We will start by loading all the necessary libraries. We will follow it up by defining our first function, get_data(). In this function, we will read the Boston dataset and return it as a predictor x and response variable y:

# Load libraries
from sklearn.datasets import load_boston
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
import numpy as np

def get_data():
    """
    Return boston dataset
    as x - predictor and
    y - response variable
    """
    data = load_boston()
    x    = data['data']
    y    = data['target']
    return x,y    

In our next build_model function, we will construct our LASSO regression model with the given data. The following two functions, view_model and model_worth, are used to introspect the model that we built:

def build_models(x,y):
    """
    Build a Lasso regression model
    """
    # Alpha values uniformly
    # spaced between 0.01 and 0.02
    alpha_range = np.linspace(0,0.5,200)
    model = Lasso(normalize=True)
    coeffiecients = []
    # Fit a model for each alpha value
    for alpha in alpha_range:
        model.set_params(alpha=alpha)
        model.fit(x,y)
        # Track the coeffiecients for plot
        coeffiecients.append(model.coef_)
    # Plot coeffients weight decay vs alpha value
    # Plot model RMSE vs alpha value
    coeff_path(alpha_range,coeffiecients)
    # View coeffiecient value
    #view_model(model)

def view_model(model):
    """
    Look at model coeffiecients
    """
    print "
 Model coeffiecients"
    print "======================
"
    for i,coef in enumerate(model.coef_):
        print "	Coefficient %d  %0.3f"%(i+1,coef)
            
    print "
	Intercept %0.3f"%(model.intercept_)

def model_worth(true_y,predicted_y):
    """
    Evaluate the model
    """
    print "	 Mean squared error = %0.2f
"%(mean_squared_error(true_y,predicted_y))

We will define two functions, coeff_path and get_coeff, to inspect our model coefficients. The coeff_path function is invoked from the build_model function to plot the weights of the coefficients for different alpha values. The get_coeff function is invoked from the main function:

def coeff_path(alpha_range,coeffiecients):
    """
    Plot residuals
    """
    plt.close('all')
    plt.cla()

    plt.figure(1)
    plt.xlabel("Alpha Values")
    plt.ylabel("Coeffiecient Weight")
    plt.title("Coeffiecient weights for different alpha values")
    plt.plot(alpha_range,coeffiecients)
    plt.axis('tight')
   
    plt.show()

def get_coeff(x,y,alpha):
    model = Lasso(normalize=True,alpha=alpha)
    model.fit(x,y)
    coefs = model.coef_
    indices = [i for i,coef in enumerate(coefs) if abs(coef) > 0.0]
    return indices

Finally, we will write our main function, which is used to invoke all the preceding functions:

if __name__ == "__main__":
    
    x,y = get_data()
    # Build multiple models for different alpha values
    # and plot them    
    build_models(x,y)
    print "
Predicting using all the variables"
    full_model = LinearRegression(normalize=True)
    full_model.fit(x,y)
    predicted_y = full_model.predict(x)
    model_worth(y,predicted_y)    
           
    print "
Models at different alpha values
"
    alpa_values = [0.22,0.08,0.01]
    for alpha in alpa_values:
        
        indices = get_coeff(x,y,alpha)   
        print "	 alpah =%0.2f Number of variables selected = %d "%(alpha,len(indices))
        print "	 attributes include ", indices
        x_new = x[:,indices]
        model = LinearRegression(normalize=True)
        model.fit(x_new,y)
        predicted_y = model.predict(x_new)
        model_worth(y,predicted_y)

How it works…

Let's start with the main module and follow the code. We will load the predictor x and response variable y using the get_data function. The function invokes scikit-learn's convenient load_boston() function to retrieve the Boston house pricing dataset as NumPy arrays.

We will proceed by calling build_models. In build_models, we will construct multiple models for the different values of alpha:

alpha_range = np.linspace(0,0.5,200)
model = Lasso(normalize=True)
coeffiecients = []
# Fit a model for each alpha value
for alpha in alpha_range:
model.set_params(alpha=alpha)
model.fit(x,y)
# Track the coeffiecients for plot
coeffiecients.append(model.coef_)

As you can see, in the for loop, we also store the coefficient values for different values of alpha in a list.

Let's plot the coefficient values for different alpha values by calling the coeff_path function:

plt.close('all')
plt.cla()

plt.figure(1)
plt.xlabel("Alpha Values")
plt.ylabel("Coeffiecient Weight")
plt.title("Coeffiecient weights for different alpha values")
plt.plot(alpha_range,coeffiecients)
plt.axis('tight')
plt.show()

In the x axis, you can see that we have the alpha values, and in the y axis, we will plot all the coefficients for a given alpha value. Let's see the output plot:

How it works…

The different colored lines represent different coefficient values. As you can see, as the value of alpha increases, the coefficient weights merge towards zero. From this plot, we can select the value of alpha.

For our reference, let's fit a simple linear regression model:

print "
Predicting using all the variables"
full_model = LinearRegression(normalize=True)
full_model.fit(x,y)
predicted_y = full_model.predict(x)
model_worth(y,predicted_y) 

Let's look at the mean square error when we try to predict using our newly built model:

How it works…

Let's proceed to select the coefficients based on LASSO:

print "
Models at different alpha values
"
alpa_values = [0.22,0.08,0.01]
for alpha in alpa_values:
indices = get_coeff(x,y,alpha) 

Based on our preceding graph, we selected 0.22, 0.08, and 0.01 as the alpha values. In the loop, we will call the get_coeff method. This method fits a LASSO model with the given alpha values and returns only the non-zero coefficients' indices:

model = Lasso(normalize=True,alpha=alpha)
model.fit(x,y)
coefs = model.coef_

indices = [i for i,coef in enumerate(coefs) if abs(coef) > 0.0]

Essentially, we are selecting only those attributes that have a non-zero coefficient value—feature selection. Let's get back to our for loop where we will fit a linear regression model with the reduced coefficients:

print "	 alpah =%0.2f Number of variables selected = %d "%(alpha,len(indices))
print "	 attributes include ", indices
x_new = x[:,indices]
model = LinearRegression(normalize=True)
model.fit(x_new,y)
predicted_y = model.predict(x_new)
model_worth(y,predicted_y)

What we want to know is how good our models would be if we predicted them with the reduced set of attributes, compared with the model that we built initially using the whole dataset:

How it works…

Look at the first pass where our alpha value is 0.22. There are only two coefficients with non-zero values, 5 and 12. The mean squared error is 30.51, which is only 9 more than the model fitted with all the variables.

Similarly, for the alpha value of 0.08, there are three non-zero coefficients. We can see some improvement in the mean squared error. Finally, with 0.01 alpha value, 9 out of 13 attributes are selected and the mean square error is very close to the model built with all the attributes.

As you can see, we didn't fit the model with all the attributes. We are able to choose a subset of the attributes automatically using LASSO. Thus, we have seen how LASSO can be used for variable selection.

There's more…

By keeping only the most important variables, LASSO avoids overfitting. However, as you can see, the mean squared error values are not that good. We can see that there is a loss in the predictive power because of LASSO.

As said before, in the case of the correlated variables, LASSO selects only one of them, whereas ridge assigns equal weights to the coefficients of both the variables. Hence, ridge has a higher predictive power compared with LASSO. However, LASSO can do variable selection, which Ridge is not capable of performing.

Note

Refer to the book, Statistical learning with sparsity: The Lasso and generalization by Trevor Hastie et al. for more information about the LASSO and ridge methodologies.

See also

  • Scaling the data recipe in Chapter 3, Data Analysis – Explore and Wrangle
  • Standardizing the data recipe in Chapter 3, Data Analysis – Explore and Wrangle
  • Preparing data for model building recipe in Chapter 6, Machine Learning I
  • Regression with L2 Shrinkage – Ridge recipe in Chapter 7, Machine Learning II
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset