Fitting the model

We can use either the SGD or second-order methods to fit the logistic regression model to our data. Let us compare the results using SGD; we fit the model using the following command:

>>> log_model_sgd = linear_model.SGDClassifier(alpha=10,loss='log',penalty='l2',n_iter=1000, fit_intercept=False).fit(census_features_train,census_income_train)

Where the parameter log for loss specifies that this is a logistic regression that we are training, and n_iter specifies the number of times we iterate over the training data to perform SGD, alpha represents the weight on the regularization term, and we specify that we do not want to fit the intercept to make comparison to other methods more straightforward (since the method of fitting the intercept could differ between optimizers). The penalty argument specifies the regularization penalty, which we saw in Chapter 4, Connecting the Dots with Models – Regression Methods, already for ridge regression. As l2 is the only penalty we can use with second-order methods, we choose l2 here as well to allow comparison between the methods. We can examine the resulting model coefficients by referencing the coeff_ property of the model object:

>>> log_model_sgd.coef_

Compare these coefficients to the second-order fit we obtain using the following command:

>>> log_model_newton = linear_model.LogisticRegression(penalty='l2',solver='lbfgs', fit_intercept=False).fit(census_features_train,census_income_train

Like the SGD model, we remove the intercept fit to allow the most direct comparison of the coefficients produced by the two methods., We find that the coefficients are not identical, with the output of the SGD model containing several larger coefficients. Thus, we see in practice that even with similar models and a convex objective function, different optimization methods can give different parameter results. However, we can see that the results are highly correlated based on a pairwise scatterplot of the coefficients:

>>> plt.scatter(log_model_newton.coef_,log_model_sgd.coef_)
>>> plt.xlim(-0.08,0.08)
>>> plt.ylim(-0.08,0.08)
>>> plt.xlabel('Newton Coefficent')
>>> plt.ylabel('SGD Coefficient')
Fitting the model

The fact that the SGD model has larger coefficients gives us a hint as to what might be causing the difference: perhaps SGD is more sensitive to differences in scale between features? Let us evaluate this hypothesis by using the StandardScaler introduced in Chapter 3, Finding Patterns in the Noise – Clustering and Unsupervised Learning in the context of K-means clustering to normalize the features before running the SGD model using the following commands:

>>> from sklearn.preprocessing import StandardScaler
>>> census_features_train_sc= StandardScaler().fit_transform(X=census_features_train.todense())

Recall that we need to turn the features matrix to a dense format since StandardScaler does not accept a sparse matrix as input. Now, if we retrain the SGD using the same arguments and plot the result versus the Newton method, we find the coefficients are much closer:

Fitting the model

This example should underscore the fact that the optimizer is sometimes as important as the actual algorithm, and may determine what steps we should take in data normalization.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset