A classification problem setup is very similar to a regression setup except for the response variable. In a classification setup, the response is a categorical variable. Due to its nature, we have a different loss function to measure the cost of the wrong predictions. Let's assume a binary classifier for our discussion and recipe, and our target variable, Y, can take the values {0
,1
}.
We will use the derivative of this loss function in our weight update rule to arrive at our weight vectors.
The SGD classifier class from scikit-learn provides us with a variety of loss functions. However, in this recipe, we will see log loss, which will give us logistic regression.
Logistic regression fits a linear model to a data of the following form:
We have given a generalized notation. The intercept is assumed to be the first dimension of our weight vector. For a binary classification problem, a logit function is applied to get a prediction. as follows:
The preceding function is also called the sigmoid function. For very large positive values of x_i, this function will return a value close to one, and vice versa for large negative values close to zero. With this, we can define our log loss function as follows:
With the preceding loss function fitted into the weight update rule of the gradient descent, we can arrive at the appropriate weight vectors.
For the log loss function defined in scikit-learn, refer to the following URL:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html.
With this knowledge, let's jump into our recipe for stochastic gradient descent-based classification.
We will leverage scikit-learn's implementation of the stochastic gradient descent classifier. As we did in some of the previous recipes, we will use the make_classification
function from scikit-learn to generate data for our recipe in order to demonstrate the stochastic gradient descent classification.
Let's start with a very simple example demonstrating how to build a stochastic gradient descent regressor.
We will first load the required libraries. We will then write a function to generate the predictors and response variables:
from sklearn.datasets import make_classification from sklearn.metrics import accuracy_score from sklearn.cross_validation import train_test_split from sklearn.linear_model import SGDClassifier import numpy as np def get_data(): """ Make a sample classification dataset Returns : Independent variable y, dependent variable x """ no_features = 30 redundant_features = int(0.1*no_features) informative_features = int(0.6*no_features) repeated_features = int(0.1*no_features) x,y = make_classification(n_samples=1000,n_features=no_features,flip_y=0.03, n_informative = informative_features, n_redundant = redundant_features ,n_repeated = repeated_features,random_state=7) return x,y
We will proceed to write functions that will help us build and validate our model:
def build_model(x,y,x_dev,y_dev): estimator = SGDClassifier(n_iter=50,shuffle=True,loss="log", learning_rate = "constant",eta0=0.0001,fit_intercept=True, penalty="none") estimator.fit(x,y) train_predcited = estimator.predict(x) train_score = accuracy_score(y,train_predcited) dev_predicted = estimator.predict(x_dev) dev_score = accuracy_score(y_dev,dev_predicted) print print "Training Accuracy = %0.2f Dev Accuracy = %0.2f"%(train_score,dev_score)
Finally, we will write our main function to invoke all the preceding functions:
if __name__ == "__main__": x,y = get_data() # Divide the data into Train, dev and test x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size = 0.3,random_state=9) x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9) build_model(x_train,y_train,x_dev,y_dev)
Let's start with our main function. We will invoke get_data
to get our x
predictor attributes and y
response attributes. In get_data
, we will leverage the make_classification
dataset in order to generate our training data for the random forest method:
def get_data(): """ Make a sample classification dataset Returns : Independent variable y, dependent variable x """ no_features = 30 redundant_features = int(0.1*no_features) informative_features = int(0.6*no_features) repeated_features = int(0.1*no_features) x,y = make_classification(n_samples=500,n_features=no_features,flip_y=0.03, n_informative = informative_features, n_redundant = redundant_features ,n_repeated = repeated_features,random_state=7) return x,y
Let's look at the parameters passed to the make_classification
method. The first parameter is the number of instances required. In this case, we need 500 instances. The second parameter is about how many attributes per instance are required. We say that we need 30. The third parameter, flip_y
, randomly interchanges 3 percent of the instances. This is done to introduce noise in our data. The next parameter is about how many out of those 30 features should be informative enough to be used in our classification. We specified that 60 percent of our features, that is, 18 out of 30, should be informative. The next parameter is about redundant features. These are generated as a linear combination of the informative features in order to introduce correlation among the features. Finally, the repeated features are duplicate features that are drawn randomly from both the informative and redundant features.
Let's split the data into training and testing sets using train_test_split
. We will reserve 30 percent of our data to test:
# Divide the data into Train, dev and test x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size = 0.3,random_state=9)
Once again, we will leverage train_test_split
to split our test data into dev and test sets:
x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9)
With the data divided to build, evaluate, and test the model, we will proceed to build our models:
build_model(x_train,y_train,x_dev,y_dev)
In build_model
, we will leverage scikit-learn's SGDClassifier
class to build our stochastic gradient descent method:
estimator = SGDClassifier(n_iter=50,shuffle=True,loss="log", learning_rate = "constant",eta0=0.0001,fit_intercept=True, penalty="none")
Let's look at the parameters that we used. The first parameter is the number of times we want to go through our dataset to update the weights. Here, we say that we want 50 iterations. As in perceptron, after going through all the records once, we need to shuffle our input records when we start the next iteration. The shuffle parameter is used for the same. The default value of shuffle is true, we have included it here for explanation purposes. Our loss function is log loss: we want to do a logistic regression and we will specify this using the loss parameter. Our learning rate, eta, is a constant that we will specify with the learning_rate
parameter. We will provide the value for our learning rate using the eta0
parameter. We will then proceed to say that we need to fit the intercept, as we have not centered our data by its mean. Finally, the penalty parameter controls the type of shrinkage required. In our case, we will say that we don't need any shrinkage using the none string.
We will proceed to build our model by invoking the fit function with our predictor and response variable, and evaluate our model with our training and dev dataset:
estimator.fit(x,y) train_predcited = estimator.predict(x) train_score = accuracy_score(y,train_predcited) dev_predicted = estimator.predict(x_dev) dev_score = accuracy_score(y_dev,dev_predicted) print print "Training Accuracy = %0.2f Dev Accuracy = %0.2f"%(train_score,dev_score)
Let's look at our accuracy scores:
Regularization, L1, L2, or elastic net can be applied for SGD classification. The procedure is the same as that of regression, and hence, we will not repeat it here. Refer to the previous recipe for this.
The learning rate, eta, was constant in our example. This need not be the case. With every iteration, the eta value can be reduced. The learning rate parameter, learning_rate
, can be set to an optimal string or invscaling. Refer to the following scikit documentation:
http://scikit-learn.org/stable/modules/sgd.html.
The parameter is specified as follows:
estimator = SGDClassifier(n_iter=50,shuffle=True,loss="log", learning_rate = "invscaling",eta0=0.001,fit_intercept=True, penalty="none")
We used the fit method to build our model. As mentioned previously, in large-scale machine learning, we know that all the data will not be available to us at once. When we receive the data in batches, we need to use the partial_fit
method, instead of fit
. Using the fit
method will reinitialize the weights and we will lose all the training information from the previous batch of data. Refer to the following link for more information on partial_fit
: