Classification and logistic regression

In the previous section, we learned how to predict continuous quantities (for example, the impact of TV advertising on company sales) as linear functions of input values (for example, TV, Radio, and newspaper advertisements). But for other tasks, the output will not be continuous quantities. For example, predicting whether someone is diseased or not is a classification problem and we need a different learning algorithm to perform this. In this section, we are going to dig deeper into the mathematical analysis of logistic regression, which is a learning algorithm for classification tasks.

In linear regression, we tried to predict the value of the output variable y⁽ⁱ⁾ for the i^th sample x⁽ⁱ⁾ in that dataset using a linear model function y = h_θ(x)=θ^Τ x. This is not really a great solution for classification tasks such as predicting binary labels (y⁽ⁱ⁾ ∈ {0,1}).

Logistic regression is one of the many learning algorithms that we can use for classification tasks, whereby we use a different hypothesis class while trying to predict the probability that a specific sample belongs to the one class and the probability that it belongs to the zero class. So, in logistic regression, we will try to learn the following functions:

The function is often called a sigmoid or logistic function, which squashes the value of θ^Τx into a fixed range [0,1], as shown in the following graph. Because the value will be squashed between [0,1], we can then interpret h_θ(x) as a probability.

Our goal is to search for a value of the parameters θ so that the probability P(y = 1|x) = h_θ(x)) is large when the input sample x belongs to the one class and small when x belongs to the zero class:

Figure 6: Shape of the sigmoid function

So, suppose we have a set of training samples with their corresponding binary labels {(x⁽ⁱ⁾,y⁽ⁱ⁾): i = 1,...,m}. We will need to minimize the following cost function, which measures how good a given h_θ does:

Note that we have only one of the two terms of the equation's summation as non-zero for each training sample (depending on whether the value of the label y⁽ⁱ⁾ is 0 or ). When y⁽ⁱ⁾ = 1, minimizing the model cost function means we need to make h_θ(x⁽ⁱ⁾) large, and when y⁽ⁱ⁾ = 0, we want to make 1-h_θ large.

Now, we have a cost function that calculates how well a given hypothesis h_θ fits our training samples. We can learn to classify our training samples by using an optimization technique to minimize J(θ) and find the best choice of parameters θ. Once we have done this, we can use these parameters to classify a new test sample as 1 or 0, checking which of these two class labels is most probable. If P(y = 1|x) < P(y = 0|x) then we output 0, otherwise we output 1, which is the same as defining a threshold of 0.5 between our classes and checking whether h_θ(x) > 0.5.

To minimize the cost function J(θ), we can use an optimization technique that finds the best value of θ that minimizes the cost function. So, we can use a calculus tool called gradient, which tries to find the greatest rate of increase of the cost function. Then, we can take the opposite direction to find the minimum value of this function; for example, the gradient of J(θ) is denoted by ∇_θJ(θ), which means taking the gradient for the cost function with respect to the model parameters. Thus, we need to provide a function that computes J(θ) and ∇_θJ(θ) for any requested choice of θ. If we derived the gradient or derivative of the cost function above J(θ) with respect to θ_j, we will get the following results:

Which can be written in a vector form as:

Now, we have a mathematical understanding of the logistic regression, so let's go ahead and use this new learning method for solving a classification task.

Table of Contents for Classification and logistic regression

Create new playlist

Sign In

Sign Up

Table of Contents for
Classification and logistic regression