Learning regression with L2 shrinkage – ridge

Let's extend the regression technique discussed before to include regularization. While training a linear regression model, some of the coefficients may take very high values, leading to instability in the model. Regularization or shrinkage is a way of controlling the weights of the coefficients such that they don't take very large values. Let's look at the linear regression cost function once again to understand what issues are inherently present with regression, and what we mean by controlling the weights of the coefficients:

Learning regression with L2 shrinkage – ridge
Learning regression with L2 shrinkage – ridge

Linear regression tries to find the coefficients, w0…wm, such that it minimizes the preceding equation. There are a few issues with linear regression.

If the dataset contains many correlated predictors, very small changes in the data can lead to an unstable model. Additionally, we will face a problem with interpreting the model results. For example, if we have two variables that are negatively correlated, these variables will have an opposite effect on the response variable. We can manually look at these correlations and remove one of the variables that is responsible and then proceed with the model building. However, it will be helpful if we can handle these scenarios automatically.

We introduced a method called recursive feature elimination in the previous recipe to keep the most informative attributes and discard the rest. However, in this approach, we either keep a variable or don't keep it; our decisions are binary. In this section, we will see a way by which we can control the weights associated with the variables in such a way that the unnecessary variables are penalized heavily and they receive extremely low weights.

We will change the cost function of linear regression to include the coefficients. As you know, the value of the cost function should be at a minimum for the best model. By including the coefficients in the cost function, we can heavily penalize the coefficients that take a very high value. In general, these techniques are known as shrinkage methods, as they try to shrink the value of the coefficients. In this recipe, we will see L2 shrinkage, most commonly called ridge regression. Let's look at the cost function for ridge regression:

Learning regression with L2 shrinkage – ridge

As you can see, the sum of the square of the coefficients is added to the cost function. This way, when the optimization routine tries to minimize the preceding function, it has to heavily reduce the value of the coefficients to attain its objective. The alpha parameter decides the amount of shrinkage. Greater the alpha value, greater the shrinkage. The coefficient values are reduced to zero.

With this little math background, let's jump into our recipe to see ridge regression in action.

Getting ready

Once again, we will use the Boston dataset to demonstrate ridge regression. The Boston data has 13 attributes and 506 instances. The target variable is a real number and the median value of the houses is in the thousands. Refer to the following UCI link for more information about the Boston dataset:

https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names

We intend to generate the polynomial features of degree two and consider only the interaction effects. At the end of this recipe, we will see how much the coefficients are penalized.

How to do it…

We will start by loading all the necessary libraries. We will follow it up by defining our first function, get_data(). In this function, we will read the Boston dataset and return it as predictor x and response variable y:

# Load libraries
from sklearn.datasets import load_boston
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures


def get_data():
    """
    Return boston dataset
    as x - predictor and
    y - response variable
    """
    data = load_boston()
    x    = data['data']
    y    = data['target']
    x    = x - np.mean(x,axis=0)
    
    return x,y    

In our next build_model function, we will construct our ridge regression model with the given data. The following two functions, view_model and model_worth, are used to introspect the model that we built:

def build_model(x,y):
    """
    Build a Ridge regression model
    """
    model = Ridge(normalize=True,alpha=0.015)
    model.fit(x,y)
    # Track the scores- Mean squared residual for plot
    return model    

def view_model(model):
    """
    Look at model coeffiecients
    """
    print "
 Model coeffiecients"
    print "======================
"
    for i,coef in enumerate(model.coef_):
        print "	Coefficient %d  %0.3f"%(i+1,coef)
            
    print "
	Intercept %0.3f"%(model.intercept_)

def model_worth(true_y,predicted_y):
    """
    Evaluate the model
    """
    print "	Mean squared error = %0.2f"%(mean_squared_error(true_y,predicted_y))
    return mean_squared_error(true_y,predicted_y)

Finally, we will write our main function, which is used to invoke all the preceding functions:

if __name__ == "__main__":
    
    x,y = get_data()
    
    # Divide the data into Train, dev and test    
    x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size = 0.3,random_state=9)
    x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9)
    
    #Prepare some polynomial features
    poly_features = PolynomialFeatures(interaction_only=True)
    poly_features.fit(x_train)
    x_train_poly = poly_features.transform(x_train)
    x_dev_poly   = poly_features.transform(x_dev)
    x_test_poly = poly_features.transform(x_test)
    
    #choosen_model,choosen_subset,low_mse = subset_selection(x_train_poly,y_train)    
    choosen_model = build_model(x_train_poly,y_train)

    predicted_y = choosen_model.predict(x_train_poly)
    print "
 Model Performance in Training set (Polynomial features)
"
    mse  = model_worth(y_train,predicted_y)  
    view_model(choosen_model)
            
    # Apply the model on dev set
    predicted_y = choosen_model.predict(x_dev_poly)
    print "
 Model Performance in Dev set  (Polynomial features)
"
    model_worth(y_dev,predicted_y)  
    
    # Apply the model on Test set
    predicted_y = choosen_model.predict(x_test_poly)

    print "
 Model Performance in Test set  (Polynomial features)
"
    model_worth(y_test,predicted_y)  

How it works…

Let's start with the main module and follow the code. We loaded the predictor x and response variable y using the get_data function. This function invokes scikit-learn' s convenient load_boston() function to retrieve the Boston house pricing dataset as NumPy arrays.

We will proceed to divide the data into train and test sets using the train_test_split function from the scikit-learn library. We will reserve 30 percent of our dataset to test. Out of this, we will extract the dev set in the next line.

We will then build the polynomial features:

poly_features = PolynomialFeatures(interaction_only=True)
poly_features.fit(x_train)

As you can see, we set interaction_only to true. By setting interaction_only to true—with x1 and x2 attributes—only the x1*x2 attribute is created. The squares of x1 and x2 are not created, assuming that the degree is 2. The default degree is two:

x_train_poly = poly_features.transform(x_train)
x_dev_poly = poly_features.transform(x_dev)
x_test_poly = poly_features.transform(x_test)

Using the transform function, we will transform our train, dev, and test datasets to include the polynomial features.

In the next line, we will build our ridge regression model using the training dataset by calling the build_model method:

model = Ridge(normalize=True,alpha=0.015)
model.fit(x,y)

The attributes in the dataset are centered by its mean and standardized by its standard deviation using the normalize parameter and setting it to true. Alpha controls the amount of shrinkage. Its value is set to 0.015. We didn't arrive at this number magically, but by running the model several times. Later in this chapter, we will see how to empirically arrive at the right value for this parameter. We will also fit the intercept for this model using the fit_intercept parameter . However, by default, the fit_intercept parameter is set to true and hence we do not specify it explicitly.

Let's now see how the model has performed in the training set. We will call the model_worth method to get the mean square error. This method takes the predicted response variable and the actual response variable to return the mean square error:

predicted_y = choosen_model.predict(x_train_poly)
print "
 Model Performance in Training set (Polynomial features)
"
mse = model_worth(y_train,predicted_y) 

Our output looks as follows:

How it works…

Before we apply our model to the test set, let's look at the coefficients' weights. We will call a function called view_model to view the coefficient's weight:

view_model(choosen_model)
How it works…

We have not shown all the coefficients. There are a total of 92. However, looking at some of them, the shrinkage effect should be visible. For instance, Coefficient 1 is almost 0 (remember that it is a very small value and we have shown only the first three decimal places here).

Let's proceed to see how our model has performed in the dev set:

predicted_y = choosen_model.predict(x_dev_poly)
print "
 Model Performance in Dev set (Polynomial features)
"
model_worth(y_dev,predicted_y) 
How it works…

Not bad, we have reached a mean square error lower than our training error. Finally, let's look at our model performance on the test set:

How it works…

Compared with our linear regression model in the previous recipe, we performed better on our test set.

There's more…

We mentioned earlier that linear regression models are very sensitive to even small changes in the dataset. Let's see a small example that will demonstrate this:

# Load libraries
from sklearn.datasets import load_boston
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures


def get_data():
    """
    Return boston dataset
    as x - predictor and
    y - response variable
    """
    data = load_boston()
    x    = data['data']
    y    = data['target']
    x    = x - np.mean(x,axis=0)
    
    return x,y    

In this code, we will fit both the linear regression and ridge regression models on the original data using the build_model function:

lin_model,ridg_model = build_model(x,y)

We will introduce a small noise in our original data, as follows:

# Add some noise to the dataset
noise = np.random.normal(0,1,(x.shape))
x = x + noise

Once again, we will fit the models on the noisy dataset. Finally, we will compare the coefficients' weights:

There's more…

After adding a small noise, when we try to fit a model using linear regression, the weights assigned are very different to the the weights assigned by the previous model. Now, let's see how ridge regression performs:

There's more…

The weights have not varied starkly between the first and second model. Hopefully, this demonstrates the stability of ridge regression under noisy data conditions.

It is always tricky to choose the appropriate alpha value. A brute force approach is to run it through multiple values and trace the path of the coefficients. From the path, choose the alpha value where the weights don't vary dramatically. We will plot the coefficients' weights using the coeff_path function.

Let's look at the coeff_path function. It first generates a list of the alpha values:

alpha_range = np.linspace(10,100.2,300)

In this case, we generated 300 uniformly spaced numbers between 10 and 100. For each of these alpha values, we will build a model and save its coefficients:

for alpha in alpha_range:
    model = Ridge(normalize=True,alpha=alpha)
    model.fit(x,y)
    coeffs.append(model.coef_)

Finally, we will plot these coefficient weights against the alpha value:

There's more…

As you can see, the values stabilize around the alpha value of 100. You can further zoom into a range close to 100 and look for an ideal value.

See also

  • Predicting real valued numbers using regression recipe in Chapter 7, Machine Learning II
  • Scaling the data recipe in Chapter 3, Data Analysis – Explore and Wrangle
  • Standardizing the data recipe in Chapter 3, Data Analysis – Explore and Wrangle
  • Preparing data for model building recipe in Chapter 6, Machine Learning I
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset