Least absolute shrinkage and selection operator (LASSO) is another shrinkage method popularly used with regression problems. LASSO leads to sparse solutions compared with ridge. A solution is called sparse if most of the coefficients are reduced to zero. In LASSO, a lot of the coefficients are made zero. In the case of correlated variables, LASSO selects only one of them, whereas ridge assigns equal weights to the coefficients of both the variables. This attribute of LASSO can hence be leveraged for variable selection. In this recipe, let's see how we can leverage LASSO for variable selection.
Let's look at the cost function of LASSO regression. If you followed through the previous two recipes, you can quickly identify the difference:
The coefficients are penalized by the sum of the absolute value of the coefficients. Once again, the alpha controls the level of penalization. Let's try to understand the intuition behind why L1 shrinkage leads to a sparse solution.
We can rewrite the preceding equation as an unconstrained cost function and a constraint, as follows:
Minimize:
Subject to the constraint:
With this equation in mind, let's plot the cost function values in the coefficient space for two coefficients, w0
and w1
:
The blue lines represent the contours of the cost function (without constraint) values for the different values of w0
and w1
. The green region represents the constraint shape dictated by the eta value. The optimized value where both the regions meet is when w0
is set to 0. We depicted a two-dimensional space where our solution is made sparse with w0
set to 0. In a multidimensional space, we will have a rhomboid in the green region, and LASSO will give a sparse solution by reducing many of the coefficients to zero.
Once again, we will use the Boston dataset to demonstrate LASSO regression. The Boston data has 13 attributes and 506 instances. The target variable is a real number and the median value of the houses is in the thousands.
Refer to the following UCI link for more information on the Boston dataset:
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names.
We will see how we can use LASSO for variable selection.
We will start by loading all the necessary libraries. We will follow it up by defining our first function, get_data()
. In this function, we will read the Boston dataset and return it as a predictor x
and response variable y
:
# Load libraries from sklearn.datasets import load_boston from sklearn.cross_validation import train_test_split from sklearn.linear_model import Lasso, LinearRegression from sklearn.metrics import mean_squared_error import matplotlib.pyplot as plt from sklearn.preprocessing import PolynomialFeatures import numpy as np def get_data(): """ Return boston dataset as x - predictor and y - response variable """ data = load_boston() x = data['data'] y = data['target'] return x,y
In our next build_model
function, we will construct our LASSO regression model with the given data. The following two functions, view_model
and model_worth
, are used to introspect the model that we built:
def build_models(x,y): """ Build a Lasso regression model """ # Alpha values uniformly # spaced between 0.01 and 0.02 alpha_range = np.linspace(0,0.5,200) model = Lasso(normalize=True) coeffiecients = [] # Fit a model for each alpha value for alpha in alpha_range: model.set_params(alpha=alpha) model.fit(x,y) # Track the coeffiecients for plot coeffiecients.append(model.coef_) # Plot coeffients weight decay vs alpha value # Plot model RMSE vs alpha value coeff_path(alpha_range,coeffiecients) # View coeffiecient value #view_model(model) def view_model(model): """ Look at model coeffiecients """ print " Model coeffiecients" print "====================== " for i,coef in enumerate(model.coef_): print " Coefficient %d %0.3f"%(i+1,coef) print " Intercept %0.3f"%(model.intercept_) def model_worth(true_y,predicted_y): """ Evaluate the model """ print " Mean squared error = %0.2f "%(mean_squared_error(true_y,predicted_y))
We will define two functions, coeff_path
and get_coeff
, to inspect our model coefficients. The coeff_path
function is invoked from the build_model
function to plot the weights of the coefficients for different alpha values. The get_coeff
function is invoked from the main function:
def coeff_path(alpha_range,coeffiecients): """ Plot residuals """ plt.close('all') plt.cla() plt.figure(1) plt.xlabel("Alpha Values") plt.ylabel("Coeffiecient Weight") plt.title("Coeffiecient weights for different alpha values") plt.plot(alpha_range,coeffiecients) plt.axis('tight') plt.show() def get_coeff(x,y,alpha): model = Lasso(normalize=True,alpha=alpha) model.fit(x,y) coefs = model.coef_ indices = [i for i,coef in enumerate(coefs) if abs(coef) > 0.0] return indices
Finally, we will write our main
function, which is used to invoke all the preceding functions:
if __name__ == "__main__": x,y = get_data() # Build multiple models for different alpha values # and plot them build_models(x,y) print " Predicting using all the variables" full_model = LinearRegression(normalize=True) full_model.fit(x,y) predicted_y = full_model.predict(x) model_worth(y,predicted_y) print " Models at different alpha values " alpa_values = [0.22,0.08,0.01] for alpha in alpa_values: indices = get_coeff(x,y,alpha) print " alpah =%0.2f Number of variables selected = %d "%(alpha,len(indices)) print " attributes include ", indices x_new = x[:,indices] model = LinearRegression(normalize=True) model.fit(x_new,y) predicted_y = model.predict(x_new) model_worth(y,predicted_y)
Let's start with the main module and follow the code. We will load the predictor x
and response variable y
using the get_data
function. The function invokes scikit-learn's convenient load_boston()
function to retrieve the Boston house pricing dataset as NumPy arrays.
We will proceed by calling build_models
. In build_models
, we will construct multiple models for the different values of alpha
:
alpha_range = np.linspace(0,0.5,200) model = Lasso(normalize=True) coeffiecients = [] # Fit a model for each alpha value for alpha in alpha_range: model.set_params(alpha=alpha) model.fit(x,y) # Track the coeffiecients for plot coeffiecients.append(model.coef_)
As you can see, in the for loop, we also store the coefficient values for different values of alpha in a list.
Let's plot the coefficient values for different alpha values by calling the coeff_path
function:
plt.close('all') plt.cla() plt.figure(1) plt.xlabel("Alpha Values") plt.ylabel("Coeffiecient Weight") plt.title("Coeffiecient weights for different alpha values") plt.plot(alpha_range,coeffiecients) plt.axis('tight') plt.show()
In the x
axis, you can see that we have the alpha values, and in the y
axis, we will plot all the coefficients for a given alpha value. Let's see the output plot:
The different colored lines represent different coefficient values. As you can see, as the value of alpha increases, the coefficient weights merge towards zero. From this plot, we can select the value of alpha.
For our reference, let's fit a simple linear regression model:
print " Predicting using all the variables" full_model = LinearRegression(normalize=True) full_model.fit(x,y) predicted_y = full_model.predict(x) model_worth(y,predicted_y)
Let's look at the mean square error when we try to predict using our newly built model:
Let's proceed to select the coefficients based on LASSO:
print " Models at different alpha values " alpa_values = [0.22,0.08,0.01] for alpha in alpa_values: indices = get_coeff(x,y,alpha)
Based on our preceding graph, we selected 0.22
, 0.08
, and 0.01
as the alpha values. In the loop, we will call the get_coeff
method. This method fits a LASSO model with the given alpha values and returns only the non-zero coefficients' indices:
model = Lasso(normalize=True,alpha=alpha) model.fit(x,y) coefs = model.coef_ indices = [i for i,coef in enumerate(coefs) if abs(coef) > 0.0]
Essentially, we are selecting only those attributes that have a non-zero coefficient value—feature selection. Let's get back to our for
loop where we will fit a linear regression model with the reduced coefficients:
print " alpah =%0.2f Number of variables selected = %d "%(alpha,len(indices)) print " attributes include ", indices x_new = x[:,indices] model = LinearRegression(normalize=True) model.fit(x_new,y) predicted_y = model.predict(x_new) model_worth(y,predicted_y)
What we want to know is how good our models would be if we predicted them with the reduced set of attributes, compared with the model that we built initially using the whole dataset:
Look at the first pass where our alpha value is 0.22
. There are only two coefficients with non-zero values, 5
and 12
. The mean squared error is 30.51
, which is only 9
more than the model fitted with all the variables.
Similarly, for the alpha value of 0.08
, there are three non-zero coefficients. We can see some improvement in the mean squared error. Finally, with 0.01
alpha value, 9 out of 13 attributes are selected and the mean square error is very close to the model built with all the attributes.
As you can see, we didn't fit the model with all the attributes. We are able to choose a subset of the attributes automatically using LASSO. Thus, we have seen how LASSO can be used for variable selection.
By keeping only the most important variables, LASSO avoids overfitting. However, as you can see, the mean squared error values are not that good. We can see that there is a loss in the predictive power because of LASSO.
As said before, in the case of the correlated variables, LASSO selects only one of them, whereas ridge assigns equal weights to the coefficients of both the variables. Hence, ridge has a higher predictive power compared with LASSO. However, LASSO can do variable selection, which Ridge is not capable of performing.