Chapter 3. Logistic Regression and Discriminant Analysis

 

"The true logic of this world is the calculus of probabilities."

 
 --James Clerk Maxwell, Scottish physicist

In the previous chapter, we took a look at using Ordinary Least Squares (OLS) to predict a quantitative outcome, in other words, linear regression. It is now time to shift gears somewhat and examine how we can develop algorithms to predict qualitative outcomes. Such outcome variables could be binary (male versus female, purchases versus does not purchase, tumor is benign versus malignant) or multinomial categories (education level or eye color). Regardless of whether or not the outcome of interest is binary or multinomial, the task of the analyst is to predict the probability that an observation would belong to which category of the outcome variable. In other words, we develop an algorithm in order to classify the observations.

To begin exploring the classification problems, we will discuss why applying the OLS linear regression is not the correct technique and how the algorithms introduced in this chapter can solve these issues. We will then look at a problem about predicting whether or not a biopsied tumor mass is classified as benign or malignant. The dataset is the well-known and widely available Wisconsin Breast Cancer Data. To tackle this problem, we will begin by building and interpreting the logistic regression models. We will also begin examining methods so as to select features and the most appropriate model. Next, we will discuss about both linear and quadratic discriminant analyses and comparing and contrasting these with logistic regression. Then, building predictive models on the breast cancer data will follow. Finally, we will wrap it all up by looking at ways to select the best overall algorithm in order to address the question at hand. These methods (creating test/train datasets and cross-validation) will set the stage for more advanced machine learning methods in the subsequent chapters.

Classification methods and linear regression

So, why can't we just use the least squares regression method that we learned in the previous chapter for a qualitative outcome? Well, as it turns out, you can but at your own risk. Let's assume for a second that you have an outcome that you are trying to predict and it has three different classes: mild, moderate, and severe. You and your colleagues also assume that the difference between mild and moderate and moderate and severe is an equivalent measure and a linear relationship. You can create a dummy variable where zero is equal to mild, one is equal to moderate, and two is equal to severe. If you have reason to believe this, then linear regression might be an acceptable solution. However, qualitative assessments such as the previous ones might lend themselves to a high level of measurement error that can bias the OLS. In most business problems, there is no scientifically acceptable way to convert a qualitative response to one that is quantitative. What if you have a response with two outcomes, say, fail and pass? Again, using the dummy variable approach, we could code the fail outcome as 0 and pass outcome as 1. Using linear regression, we could build a model where the predicted value is the probability of an observation of pass or fail. However, the estimates of Y in the model will most likely exceed the probability constraints of [0,1] and thus, be a bit difficult to interpret.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset