Using logistic regression

Contrary to its name, logistic regression is a classification method, and is very powerful when it comes to text-based classification. It achieves this by first performing regression on a logistic function, hence the name.

A bit of math with a small example

To get an initial understanding of the way logistic regression works, let us first take a look at the following example, where we have an artificial feature value at the X axis plotted with the corresponding class range, either 0 or 1. As we can see, the data is so noisy that classes overlap in the feature value range between 1 and 6. Therefore, it is better to not directly model the discrete classes, but rather the probability that a feature value belongs to class 1, P(X). Once we possess such a model, we could then predict class 1 if P(X) > 0.5 or class 0 otherwise:

A bit of math with a small example

Mathematically, it is always difficult to model something that has a finite range, as is the case here with our discrete labels 0 and 1. We can, however, tweak the probabilities a bit so that they always stay between 0 and 1. For this, we will need the odds ratio and its logarithm.

Let's say a feature has the probability of 0.9 that it belongs to class 1, that is, P(y=1) = 0.9. The odds ratio is then P(y=1)/P(y=0) = 0.9/0.1 = 9. We could say that the chance is 9:1 that this feature maps to class 1. If P(y=0.5), we would consequently have a 1:1 chance that the instance is of class 1. The odds ratio is bounded by 0, but goes to infinity (the left graph in the following screenshot). If we now take the logarithm of it, we can map all probabilities between 0 and 1 to the full range from negative to positive infinity (the right graph in the following screenshot). The best part is that we still maintain the relationship that higher probability leads to a higher log of odds—it's just not limited to 0 or 1 anymore:

A bit of math with a small example

This means that we can now fit linear combinations of our features (ok, we have only one feature and a constant, but that will change soon) to the A bit of math with a small example values. Let's consider the linear equation in Chapter 1, Getting Started with Python Machine Learning shown as follows:

A bit of math with a small example

This can be replaced with the following equation (by replacing y with p):

A bit of math with a small example

We can solve the equation for A bit of math with a small example as shown in the following formula:

A bit of math with a small example

We simply have to find the right coefficients such that the formula will give the lowest errors for all our pairs (xi, pi) in the dataset, which will be detected by Scikit-learn.

After fitting the data to the class labels, the formula will give the probability for every new data point, x, that belongs to class 1. Refer to the following code:

>>> from sklearn.linear_model import LogisticRegression
>>> clf = LogisticRegression()
>>> print(clf)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, penalty=l2, tol=0.0001)
>>> clf.fit(X, y)
>>> print(np.exp(clf.intercept_), np.exp(clf.coef_.ravel()))
[ 0.09437188] [ 1.80094112]
>>> def lr_model(clf, X):
return 1 / (1 + np.exp(-(clf.intercept_ + clf.coef_*X)))
>>> print("P(x=-1)=%.2f	P(x=7)=%.2f"%(lr_model(clf, -1), lr_model(clf, 7)))
P(x=-1)=0.05	P(x=7)=0.85

You might have noticed that Scikit-learn exposes the first coefficient through the special field intercept_.

If we plot the fitted model, we see that it makes perfect sense given the data:

A bit of math with a small example

Applying logistic regression to our postclassification problem

Admittedly, the example in the previous section was created to show the beauty of logistic regression. How does it perform on the extremely noisy data?

Comparing it to the best nearest neighbour classifier (k = 90) as a baseline, we see that it performs a bit better, but also won't change the situation a whole lot:

Method

mean(scores)

stddev(scores)

LogReg C=0.1

0.6310

0.02791

LogReg C=100.00

0.6300

0.03170

LogReg C=10.00

0.6300

0.03170

LogReg C=0.01

0.6295

0.02752

LogReg C=1.00

0.6290

0.03270

90NN

0.6280

0.02777

We have seen the accuracy for the different values of the regularization parameter C. With it, we can control the model complexity, similar to the parameter k for the nearest neighbor method. Smaller values for C result in a higher penalty, that is, they make the model more complex.

A quick look at the bias-variance chart for our best candidate, C = 0.1, shows that our model has high bias—test and train error curves approach closely but stay at unacceptably high values. This indicates that logistic regression with the current feature space is under-fitting and cannot learn a model that captures the data correctly.

Applying logistic regression to our postclassification problem

So what now? We switched the model and tuned it as much as we could with our current state of knowledge, but we still have no acceptable classifier.

It seems more and more that either the data is too noisy for this task or that our set of features is still not appropriate to discriminate the classes that are good enough.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset