Logistic regression

With our least squares model, we have applied it to solve the minimization problem. We can also use a variation of this idea to solve classification problems. Consider what happens when we apply linear regression to a classification problem. Let's take the simple case of binary classification with one feature. We can plot our feature on the x axis against the class labels on the y axis. Our feature variable is continuous, but our target variable on the y axis is discrete. For binary classification, we usually represent a 0 for the negative class, and a 1 for the positive class. We construct a regression line through the data and use a threshold on the y axis to estimate the decision boundary. Here we use a threshold of 0.5.

Logistic regression

In the figure on the left-hand side, where the variance is small and our positive and negative cases are well separated, we get an acceptable result. The algorithm correctly classifies the training set. In the image on the right-hand side, we have a single outlier in the data. This makes our regression line flatter and shifts our cutoff to the right. The outlier, which clearly belongs in class 1, should not make any difference to the model's prediction, however, now with the same cutoff point, the prediction misclassifies the first instance of class 1 as class 0.

One way that we approach the problem is to formulate a different hypothesis representation. For logistic regression, we are going use the linear function as an input to another function, g.

Logistic regression

The term g is called the sigmoid or logistic function. You will notice from its graph that, on the y axis, it has asymptotes at zero and one, and it crosses the axis at 0.5.

Logistic regression

Now, if we replace the z with WT x, we can rewrite our hypothesis function like this:

Logistic regression

As with linear regression, we need to fit the parameters, w, to our training data to give us a function that can make predictions. Before we try and fit the model, let's look at how we can interpret the output from our hypothesis function. Since this will return a number between zero and one, the most natural way to interpret this is as it being the probability of the positive class. Since we know, or assume, that each sample can only belong in one of two classes, then the probability of the positive class plus the probability of the negative class must be equal to one. Therefore, if we can estimate the positive class, then we can estimate the probability of the negative class. Since we are ultimately trying to predict the class of a particular sample, we can interpret the output of the hypothesis function as positive if it returns a value greater than or equal to 0.5, or negative otherwise. Now, given the characteristics of the sigmoid function, we can write the following:

Logistic regression

Whenever our hypothesis function, on a particular training sample, returns a number greater than or equal to zero, we can predict a positive class. Let's look at a simple example. We have not yet fitted our parameters to this model, and we will do so shortly, but for the sake of this example, let's assume that we have a parameter vector as follows:

Logistic regression

Our hypothesis function, therefore, looks like this:

Logistic regression

We can predict y = 1 if the following condition is met:

Logistic regression

Equivalently:

Logistic regression

This can be sketched with the following graph:

Logistic regression

This is simply a straight line between x=3 and y=3, and it represents the decision boundary. It creates two regions where we predict either y = 0 or y = 1. What happens when the decision boundary is not a straight line? In the same way that we added polynomials to the hypothesis function in linear regression, we can also do this with logistic regression. Let's write a new hypothesis function with some higher order terms to see how we can fit it to the data:

Logistic regression

Here we have added two squared terms to our function. We will see how to fit the parameters shortly, but for now, let's set our parameter vector to the following:

Logistic regression

So, we can now write the following:

Logistic regression

Or alternatively, we can write this:

Logistic regression

This, you may recognize, is the equation for a circle centered around the origin, and we can use this as our decision boundary. We can create more complex decision boundaries by adding higher order polynomial terms.

The Cost function for logistic regression

Now, we need to look at the important task of fitting the parameters to the data. If we rewrite the cost function we used for linear regression more simply, we can see that the cost is one half of the squared error:

The Cost function for logistic regression

The interpretation is that it is simply calculating the cost we want the model to incur, given a certain prediction, that is, hw(x), and a training label, y.

This will work to a certain extent with logistic regression, however, there is a problem. With logistic regression, our hypothesis function is dependent on the nonlinear sigmoid function, and when we plot this against our parameters, it will usually produce a function that is not convex. This means that, when we try to apply an algorithm such as gradient descent to the cost function, it will not necessarily converge to the global minimum. A solution is to define a cost function that is convex, and it turns out that the following two functions, one for each class, are suitable for our purposes:

The Cost function for logistic regression

This gives us the following graphs:

The Cost function for logistic regression

Intuitively, we can see that this does what we need it to do. If we consider a single training sample in the positive class, that is y = 1, and if our hypothesis function, hw(x), correctly predicts 1, then the cost, as you would expect, is 0. If the output of the hypothesis function is 0, it is incorrect, so the cost approaches infinity. When y is in the negative class, our cost function is the graph on the right. Here the cost is zero when hw(x) is 0 and rises to infinity when hw(x) is 1. We can write this in a more compact way, remembering that y is either 0 or 1:

The Cost function for logistic regression

We can see that, for each of the possibilities, y=1 or y=0, the irrelevant term is multiplied by 0, leaving the correct term for each particular case. So, now we can write our cost function as follows:

The Cost function for logistic regression

So, if we are given a new, unlabeled value of x, how do we make a prediction? As with linear regression, our aim is to minimize the cost function, J(w). We can use the same update rule that we used for linear regression, that is, using the partial derivative to find the slope, and when we rewrite the derivative, we get the following:

The Cost function for logistic regression
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset