Using cross-validation iterators with L1 and L2 shrinkage

In the previous chapter, we saw methods to divide the data into train and test sets. In the subsequent recipes, we again performed a split on the test dataset to arrive at a dev dataset. The idea was to keep the test set away from the model building cycle. However, as we need to improve our model continuously, we used the dev set to test the model accuracy in each iteration. Though it's a good approach, this method is difficult to implement if we don't have a large dataset. We want to provide as much data as possible to train our model but still need to hold some of the data for the evaluation and final testing. In many real-world scenarios, it is very rare to get a very large dataset.

In this recipe, we will see a method called cross-validation to help us address this issue. This approach is typically called k-fold cross-validation. The training set is divided into k-folds. The model is trained on K-1 (K minus 1) folds and the left out fold is used to test. This way, we don't need a separate dev dataset.

Let's see some of the iterators provided by the scikit-learn library to perform the k-fold cross-validation effectively. Equipped with the knowledge of cross-validation, we will further see how we can leverage cross-validation for the selection of the alpha values in shrinkage methods.

Getting ready

We will use the Iris dataset to demonstrate the various cross-validation iterators concepts. We will return to our Boston housing dataset to demonstrate how cross-validation can be used successfully to find the ideal alpha values in shrinkage methods.

How to do it…

Let's look at how to use the cross validation iterator:

from sklearn.datasets import load_iris
from sklearn.cross_validation import KFold,StratifiedKFold

def get_data():
    data = load_iris()
    x = data['data']
    y = data['target']
    return x,y

def class_distribution(y):
        class_dist = {}
        total = 0
        for entry in y:
            try:
                class_dist[entry]+=1
            except KeyError:
                class_dist[entry]=1
            total+=1
        
        for k,v in class_dist.items():
            print "	class %d percentage =%0.2f"%(k,v/(1.0*total))
    
if __name__ == "__main__":
    x,y = get_data()
    # K Fold
    # 3 folds
    kfolds = KFold(n=y.shape[0],n_folds=3)
    fold_count  =1
    print
    for train,test in kfolds:
        print "Fold %d x train shape"%(fold_count),x[train].shape,
        " x test shape",x[test].shape
        fold_count==1
    print
    #Stratified KFold
    skfolds = StratifiedKFold(y,n_folds=3)
    fold_count  =1
    for train,test in skfolds:
        print "
Fold %d x train shape"%(fold_count),x[train].shape,
        " x test shape",x[test].shape
        y_train = y[train]
        y_test  = y[test]
        print "Train Class Distribution"
        class_distribution(y_train)
        print "Test Class Distribution"
        class_distribution(y_test)

        fold_count+=1

    print

In our main function, we will call the get_data function to load the Iris dataset. We will then proceed to demonstrate a simple k-fold and stratified k-folds.

With the knowledge of k-fold cross-validation, let's write a recipe to leverage this newly found knowledge in enhancing ridge regression:

# Load libraries
from sklearn.datasets import load_boston
from sklearn.cross_validation import KFold,train_test_split
from sklearn.linear_model import Ridge
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures
import numpy as np


def get_data():
    """
    Return boston dataset
    as x - predictor and
    y - response variable
    """
    data = load_boston()
    x    = data['data']
    y    = data['target']
    return x,y    

We will start by loading all the necessary libraries. We will follow it up by defining our first function, get_data(). In this function, we will read the Boston dataset and return it as a predictor x and response variable y.

In our next build_model function, we will construct our ridge regression model with the given data. We will leverage the k-fold cross-validation.

The following two functions, view_model and model_worth, are used to introspect the model that we built.

Finally, we will write the display_param_results function to view the model errors in each fold:

def build_model(x,y):
    """
    Build a Ridge regression model
    """
    kfold = KFold(y.shape[0],5)
    model = Ridge(normalize=True)

    alpha_range = np.linspace(0.0015,0.0017,30)
    grid_param = {"alpha":alpha_range}
    grid = GridSearchCV(estimator=model,param_grid=grid_param,cv=kfold,scoring='mean_squared_error')
    grid.fit(x,y)
    display_param_results(grid.grid_scores_)
    print grid.best_params_
    # Track the scores- Mean squared residual for plot
    return grid.best_estimator_

def view_model(model):
    """
    Look at model coeffiecients
    """
    #print "
 Estimated Alpha = %0.3f"%model.alpha_
    print "
 Model coeffiecients"
    print "======================
"
    for i,coef in enumerate(model.coef_):
        print "	Coefficient %d  %0.3f"%(i+1,coef)
            
    print "
	Intercept %0.3f"%(model.intercept_)

def model_worth(true_y,predicted_y):
    """
    Evaluate the model
    """
    print "	Mean squared error = %0.2f"%(mean_squared_error(true_y,predicted_y))
    return mean_squared_error(true_y,predicted_y)

def display_param_results(param_results):
    fold = 1
    for param_result in param_results:
        print "Fold %d Mean squared error %0.2f"%(fold,abs(param_result[1])),param_result[0]
        fold+=1

Finally, we will write our main function, which is used to invoke all the preceding functions:

if __name__ == "__main__":
    
    x,y = get_data()
    
    # Divide the data into Train and test    
    x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3,random_state=9)

    #Prepare some polynomial features
    poly_features = PolynomialFeatures(interaction_only=True)
    poly_features.fit(x_train)
    x_train_poly = poly_features.transform(x_train)
    x_test_poly  = poly_features.transform(x_test)
    
    choosen_model = build_model(x_train_poly,y_train)
    predicted_y = choosen_model.predict(x_train_poly)
    model_worth(y_train,predicted_y)
    
    view_model(choosen_model)
    
    predicted_y = choosen_model.predict(x_test_poly)
    model_worth(y_test,predicted_y)

How it works…

Let's start with our main method. We will start with the KFold class. This iterator class is instantiated with the number of instances in our dataset and the number of folds that we require:

kfolds = KFold(n=y.shape[0],n_folds=3)

Now, we can iterate through the folds, as follows:

fold_count =1
print
for train,test in kfolds:
print "Fold %d x train shape"%(fold_count),x[train].shape,
" x test shape",x[test].shape
fold_count==1

Let's see the print statement output:

How it works…

We can see that the data is split into three parts, each with 100 instances to train and 50 instances to test.

We will move on next to StratifiedKFold. Recall our discussion on having a uniform class distribution in the train and test split from the previous chapter. StratifiedKFold achieves a uniform class distribution across the three folds.

It is invoked as follows:

skfolds = StratifiedKFold(y,n_folds=3)

As it needs to know the distribution of the class label in the dataset, this iterator object takes the response variable y as one of its parameters. The other parameter is the number of folds requested.

Let's print the shape of our train and test sets in these three folds, along with their class distribution. We will use the class_distribution function to print the distribution of the classes in each of the folds:

fold_count =1
for train,test in skfolds:
print "
Fold %d x train shape"%(fold_count),x[train].shape,
" x test shape",x[test].shape
y_train = y[train]
y_test = y[test]
print "Train Class Distribution"
class_distribution(y_train)
print "Test Class Distribution"
class_distribution(y_test)

fold_count+=1
How it works…

You can see that the classes are distributed uniformly.

Let's assume that you build a five-fold dataset, you fit five different models, and you have five different accuracy scores. You can now take the mean of these scores to evaluate how good your model has turned out to be. If you are not satisfied, you can go ahead and start rebuilding your model with a different set of parameters and again run it on the five-fold data and see the mean accuracy score. This way, you can continuously improve the model by finding the right parameter values only using the training dataset.

Armed with this knowledge, let's revisit our old ridge regression problem.

Let's start with the main module and follow the code. We will load the predictor x and response variable y using the get_data function. This function invokes scikit-learn's convenient load_boston() function to retrieve the Boston house pricing dataset as NumPy arrays.

We will proceed to divide the data into train and test sets using the train_test_split function from the scikit-learn library. We will reserve 30 percent of our dataset to test.

We will then to build the polynomial features:

poly_features = PolynomialFeatures(interaction_only=True)
poly_features.fit(x_train)

As you can see, we set interaction_only to true. By setting interaction_only to true—with x1 and x2 attributes—only the x1*x2 attribute is created. The squares of x1 and x2 are not created, assuming that the degree is two. The default degree is two:

x_train_poly = poly_features.transform(x_train)
x_test_poly = poly_features.transform(x_test)

Using the transform function, we will transform our train and test dataset to include the polynomial features. Let's call the build_model function. The first thing that we notice in the build_model function is the k-fold declaration. We will apply our knowledge of cross-validation here and create a five-fold dataset:

kfold = KFold(y.shape[0],5)

We will then create our ridge object:

model = Ridge(normalize=True)

Let's now see how we can leverage our k-folds to figure out the ideal alpha value for our ridge regression. In the next line, we will create an object out of GridSearchCV:

grid = GridSearchCV(estimator=model,param_grid=grid_param,cv=kfold,scoring='mean_squared_error')

GridSearchCV is a convenient function from scikit-learn that helps us train our models with a range of parameters. In this case, we want to find the ideal alpha value, and hence, would like to train our models with different alpha values. Let's look at the parameters passed to GridSearchCV:

Estimator: This is the type of model that should be run with the given parameter and data. In our case, we want to run ridge regression. Hence, we will create a ridge object and pass it to GridSearchCV.

Param-grid: This is a dictionary of parameters that we want to evaluate our model on. Let's work this through in detail. We will first declare the range of alpha values that we want to build our model on:

alpha_range = np.linspace(0.0015,0.0017,30)

This gives us a NumPy array of 30 uniformly spaced elements starting from 0.0015 and ending at 0.0017. We want to build a model for each of these values. We will create a dictionary object called grid_param and make an entry under the alpha key with the generated NumPy array of alpha values:

grid_param = {"alpha":alpha_range}

We will pass this dictionary as a parameter to GridSearchCV. Look at the entry, param_grid=grid_param.

Cv: This defines the kind of cross-validation that we are interested in. We will pass the k-fold (five-fold) iterator that we created before as the cv parameter.

Finally, we need to define a scoring function. In our case, we are interested in finding out the squared error. This is the metric with which we will evaluate our model.

So, internal GridSearchCV will build five models for each of our parameter values and return the mean score when tested in the left out folds. In our case, we have five folds of test data, so the average of the score values across these five folds of test data is returned.

With this explained, we will then fit our model, that is, start our grid search activity.

Finally, we want to see the output at the various parameter settings. We will use the display_param_results function to display the average mean squared error across the different folds:

How it works…

Each line in the output displays the parameter alpha value and average mean squared error from the test folds. We can see that as we move deep into the 0.0016 range, the mean square error is increasing. Hence, we decide to stop at 0.0015. We can query the grid object to get the best parameter and estimator:

print grid.best_params_
return grid.best_estimator_

This was not the first set of alpha values that we tested it with. Our initial alpha values were as follows:

alpha_range = np.linspace(0.01,1.0,30)

The following was our output:

How it works…

When our alpha values were above 0.01, the mean squared error was shooting up. Hence, we again gave a new range:

alpha_range = np.linspace(0.001,0.1,30)

Our output was as follows:

How it works…

This way, iteratively we arrived at the range starting at 0.0015 and ending at 0.0017.

We will then get the best estimator from our grid search and apply it to our train and test data:

choosen_model = build_model(x_train_poly,y_train)
predicted_y = choosen_model.predict(x_train_poly)
model_worth(y_train,predicted_y)

Our model_worth function prints the mean squared error value in our training dataset:

How it works…

Let's view our coefficient weights:

How it works…

We have not displayed all of them but when you run the code, you can view all the values.

Finally, let's apply the model to our test dataset:

How it works…

Thus, we used cross-validation and grid search to arrive at an alpha value for our ridge regression successfully. Our model has resulted in a lower mean squared error compared with the value in the ridge regression recipe.

There's more…

There are other cross-validation iterators available with scikit-learn. Of particular interest in this case is the leave-one-out iterator. You can read more about this at http://scikit-learn.org/stable/modules/cross_validation.html#leave-one-out-loo.

In this method, given the number of folds, it leaves one record to test and returns the rest to train. For example, if your input data has 100 instances and if we require five folds, we will get 99 instances to train and one to test in each fold.

In the grid search method that we used before, if we don't provide a custom iterator to the cross validation (cv) parameter, it will by default use the leave-one-out cross-validation method:

grid = GridSearchCV(estimator=model,param_grid=grid_param,cv=None,scoring='mean_squared_error')

See also

  • Scaling the data recipe in Chapter 3, Data Analysis – Explore and Wrangle
  • Standardizing the data recipe in Chapter 3, Data Analysis – Explore and Wrangle
  • Preparing data for model building recipe in Chapter 6, Machine Learning I
  • Regression with L2 Shrinkage – Ridge recipe in Chapter 7, Machine Learning II
  • Regression with L2 Shrinkage – Lasso recipe in Chapter 7, Machine Learning II
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset