Linear regression limitations for categorical variables

Let's try to face a problem involving categorical variables with the only model you know—linear regression.

Let's imagine, for instance that you are dealing with a dataset showing; for a group of people, some explanatory variables sided with the obtained level of education articulated as high school, university, and doctor of philosophy:

Degree Family income Parents' highest degree
High school 93922 University
High school 63019 High school
University 51787 University
University 31954 High school
University 42681 High school
High school 50378 Doctor of philosophy
Doctor of philosophy 66107 University

 

How would you apply linear regression to this model? You would probably move the degree of education to a numerical scale. This is reasonable, but how to do that? You would probably resort to a scale from 1 to 3, having 1 as high school and 3 as doctor of philosophy:

n degree family_income parents_highest_degree
1 High school 93922 University
1 High school 63019 High school
2 University 51787 University
2 University 31954 High school
2 University 42681 High school
1 High school 50378 Doctor of philosophy
3 Doctor of philosophy 66107 University

 

Now, let's try to create a similar dataset in R and fit our regression model to this data:

education <- data.frame(n = c(1,1,2,2,2,1,3),
degree = c("high school","high school", "university",
"university", "university","high school","doctor of philosophy"),
family_income = c(93922,63019,51787,31954,42681,50378,66107),
parents_highest_degree = c("university","high school", "university" , "high school" , "high school" , "doctor of philosophy", "university"))

lm_model <- lm(n~ family_income + parents_highest_degree, data = education)

This actually results in a linear regression model. Even if you call summary() on it, you find out that it shows not astonishing performances, with an adjusted R-squared (you do remember why we should look at the adjusted one, don't you?) roughly equal to 30%.

But this is not the only problem with this model.

Imagine now that a new record is made available, regarding a student with a family income of 140,000 and university as the highest degree among the student's parents. If you try to predict the expected degree of education for that student, you will get a surprising -0.2029431. We can easily do this in R using the predict function:

predict(lm_model,newdata = data.frame(family_income = 140000, parents_highest_degree = "university"))

1
-0.2029431

What degree of education is that? This is the first kind of problem when trying to employ linear regression with categorical problems. But we also have another problem that our first set of data didn't show. Let's look at this dataset:

preferred_movie_type annual_income occupation
Action 66322 Doctor
Action 43873 Student
Thriller 2000 Student
Comedy 20360 Musician
Thriller 0 Housewife

 

Let's say we want to predict the preferred movie type based on the annual income and main occupation, employing our beloved linear regression. How would you proceed? How would you assign a numerical value to preferred_movie_type? Every possible choice would involve a high level of arbitrary decisions. There is no clear ranking within movie types, and introducing them just for the need of fitting our model would imply heavily manipulating our data and its original structure, with serious implications on the level of significance of our model's results.

You now have at least two reasons for which it is not advisable to employ linear regression for classification problems, and why we need specific models to perform classification tasks. One really small note before moving on—this holds true also for non-linear regression models, which I am not going to show you, but basically involve a non-linear regression between the response variable and explanatory variables. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset