Using stochastic gradient descent for classification

A classification problem setup is very similar to a regression setup except for the response variable. In a classification setup, the response is a categorical variable. Due to its nature, we have a different loss function to measure the cost of the wrong predictions. Let's assume a binary classifier for our discussion and recipe, and our target variable, Y, can take the values {0,1}.

We will use the derivative of this loss function in our weight update rule to arrive at our weight vectors.

The SGD classifier class from scikit-learn provides us with a variety of loss functions. However, in this recipe, we will see log loss, which will give us logistic regression.

Logistic regression fits a linear model to a data of the following form:

Using stochastic gradient descent for classification

We have given a generalized notation. The intercept is assumed to be the first dimension of our weight vector. For a binary classification problem, a logit function is applied to get a prediction. as follows:

Using stochastic gradient descent for classification

The preceding function is also called the sigmoid function. For very large positive values of x_i, this function will return a value close to one, and vice versa for large negative values close to zero. With this, we can define our log loss function as follows:

Using stochastic gradient descent for classification

With the preceding loss function fitted into the weight update rule of the gradient descent, we can arrive at the appropriate weight vectors.

For the log loss function defined in scikit-learn, refer to the following URL:

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html.

With this knowledge, let's jump into our recipe for stochastic gradient descent-based classification.

Getting ready

We will leverage scikit-learn's implementation of the stochastic gradient descent classifier. As we did in some of the previous recipes, we will use the make_classification function from scikit-learn to generate data for our recipe in order to demonstrate the stochastic gradient descent classification.

How to do it…

Let's start with a very simple example demonstrating how to build a stochastic gradient descent regressor.

We will first load the required libraries. We will then write a function to generate the predictors and response variables:

from sklearn.datasets import make_classification
from sklearn.metrics import  accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import SGDClassifier

import numpy as np

def get_data():
    """
    Make a sample classification dataset
    Returns : Independent variable y, dependent variable x
    """
    no_features = 30
    redundant_features = int(0.1*no_features)
    informative_features = int(0.6*no_features)
    repeated_features = int(0.1*no_features)
    x,y = make_classification(n_samples=1000,n_features=no_features,flip_y=0.03,
            n_informative = informative_features, n_redundant = redundant_features 
            ,n_repeated = repeated_features,random_state=7)
    return x,y

We will proceed to write functions that will help us build and validate our model:

def build_model(x,y,x_dev,y_dev):
    estimator = SGDClassifier(n_iter=50,shuffle=True,loss="log", 
                learning_rate = "constant",eta0=0.0001,fit_intercept=True, penalty="none")
    estimator.fit(x,y)
    train_predcited = estimator.predict(x)
    train_score = accuracy_score(y,train_predcited)
    dev_predicted = estimator.predict(x_dev)
    dev_score = accuracy_score(y_dev,dev_predicted)
    
    print 
    print "Training Accuracy = %0.2f Dev Accuracy = %0.2f"%(train_score,dev_score)

Finally, we will write our main function to invoke all the preceding functions:

if __name__ == "__main__":
    x,y = get_data()    

    # Divide the data into Train, dev and test    
    x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size = 0.3,random_state=9)
    x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9)
    
    build_model(x_train,y_train,x_dev,y_dev)

How it works…

Let's start with our main function. We will invoke get_data to get our x predictor attributes and y response attributes. In get_data, we will leverage the make_classification dataset in order to generate our training data for the random forest method:

def get_data():
    """
    Make a sample classification dataset
    Returns : Independent variable y, dependent variable x
    """
    no_features = 30
    redundant_features = int(0.1*no_features)
    informative_features = int(0.6*no_features)
    repeated_features = int(0.1*no_features)
    x,y = make_classification(n_samples=500,n_features=no_features,flip_y=0.03,
            n_informative = informative_features, n_redundant = redundant_features 
            ,n_repeated = repeated_features,random_state=7)
    return x,y

Let's look at the parameters passed to the make_classification method. The first parameter is the number of instances required. In this case, we need 500 instances. The second parameter is about how many attributes per instance are required. We say that we need 30. The third parameter, flip_y, randomly interchanges 3 percent of the instances. This is done to introduce noise in our data. The next parameter is about how many out of those 30 features should be informative enough to be used in our classification. We specified that 60 percent of our features, that is, 18 out of 30, should be informative. The next parameter is about redundant features. These are generated as a linear combination of the informative features in order to introduce correlation among the features. Finally, the repeated features are duplicate features that are drawn randomly from both the informative and redundant features.

Let's split the data into training and testing sets using train_test_split. We will reserve 30 percent of our data to test:

    # Divide the data into Train, dev and test    
    x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size = 0.3,random_state=9)

Once again, we will leverage train_test_split to split our test data into dev and test sets:

    x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9)

With the data divided to build, evaluate, and test the model, we will proceed to build our models:

build_model(x_train,y_train,x_dev,y_dev)

In build_model, we will leverage scikit-learn's SGDClassifier class to build our stochastic gradient descent method:

    estimator = SGDClassifier(n_iter=50,shuffle=True,loss="log", 
                learning_rate = "constant",eta0=0.0001,fit_intercept=True, penalty="none")

Let's look at the parameters that we used. The first parameter is the number of times we want to go through our dataset to update the weights. Here, we say that we want 50 iterations. As in perceptron, after going through all the records once, we need to shuffle our input records when we start the next iteration. The shuffle parameter is used for the same. The default value of shuffle is true, we have included it here for explanation purposes. Our loss function is log loss: we want to do a logistic regression and we will specify this using the loss parameter. Our learning rate, eta, is a constant that we will specify with the learning_rate parameter. We will provide the value for our learning rate using the eta0 parameter. We will then proceed to say that we need to fit the intercept, as we have not centered our data by its mean. Finally, the penalty parameter controls the type of shrinkage required. In our case, we will say that we don't need any shrinkage using the none string.

We will proceed to build our model by invoking the fit function with our predictor and response variable, and evaluate our model with our training and dev dataset:

 estimator.fit(x,y)
    train_predcited = estimator.predict(x)
    train_score = accuracy_score(y,train_predcited)
    dev_predicted = estimator.predict(x_dev)
    dev_score = accuracy_score(y_dev,dev_predicted)
    
    print 
    print "Training Accuracy = %0.2f Dev Accuracy = %0.2f"%(train_score,dev_score)

Let's look at our accuracy scores:

How it works…

There's more…

Regularization, L1, L2, or elastic net can be applied for SGD classification. The procedure is the same as that of regression, and hence, we will not repeat it here. Refer to the previous recipe for this.

The learning rate, eta, was constant in our example. This need not be the case. With every iteration, the eta value can be reduced. The learning rate parameter, learning_rate, can be set to an optimal string or invscaling. Refer to the following scikit documentation:

http://scikit-learn.org/stable/modules/sgd.html.

The parameter is specified as follows:

estimator = SGDClassifier(n_iter=50,shuffle=True,loss="log", 
learning_rate = "invscaling",eta0=0.001,fit_intercept=True, penalty="none")

We used the fit method to build our model. As mentioned previously, in large-scale machine learning, we know that all the data will not be available to us at once. When we receive the data in batches, we need to use the partial_fit method, instead of fit. Using the fit method will reinitialize the weights and we will lose all the training information from the previous batch of data. Refer to the following link for more information on partial_fit:

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier.partial_fit.

See also

  • Shrinkage using Ridge Regression recipe in Chapter 7, Machine Learning II
  • Using stochastic gradient descent for regression recipe in Chapter 9, Machine Learning III
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset