Linear regression limitations for categorical variables

Let's try to face a problem involving categorical variables with the only model you know—linear regression.

Let's imagine, for instance that you are dealing with a dataset showing; for a group of people, some explanatory variables sided with the obtained level of education articulated as high school, university, and doctor of philosophy:

Degree	Family income	Parents' highest degree
High school	93922	University
High school	63019	High school
University	51787	University
University	31954	High school
University	42681	High school
High school	50378	Doctor of philosophy
Doctor of philosophy	66107	University

How would you apply linear regression to this model? You would probably move the degree of education to a numerical scale. This is reasonable, but how to do that? You would probably resort to a scale from 1 to 3, having 1 as high school and 3 as doctor of philosophy:

`n`	`degree`	`family_income`	`parents_highest_degree`
1	High school	93922	University
1	High school	63019	High school
2	University	51787	University
2	University	31954	High school
2	University	42681	High school
1	High school	50378	Doctor of philosophy
3	Doctor of philosophy	66107	University

Now, let's try to create a similar dataset in R and fit our regression model to this data:

education <- data.frame(n = c(1,1,2,2,2,1,3),
 degree = c("high school","high school", "university",
 "university", "university","high school","doctor of philosophy"),
 family_income = c(93922,63019,51787,31954,42681,50378,66107),
 parents_highest_degree = c("university","high school", "university" , "high school" , "high school" , "doctor of philosophy", "university"))

lm_model <- lm(n~ family_income + parents_highest_degree, data = education)

This actually results in a linear regression model. Even if you call summary() on it, you find out that it shows not astonishing performances, with an adjusted R-squared (you do remember why we should look at the adjusted one, don't you?) roughly equal to 30%.

But this is not the only problem with this model.

Imagine now that a new record is made available, regarding a student with a family income of 140,000 and university as the highest degree among the student's parents. If you try to predict the expected degree of education for that student, you will get a surprising -0.2029431. We can easily do this in R using the predict function:

predict(lm_model,newdata = data.frame(family_income = 140000, parents_highest_degree = "university"))

1
 -0.2029431

What degree of education is that? This is the first kind of problem when trying to employ linear regression with categorical problems. But we also have another problem that our first set of data didn't show. Let's look at this dataset:

`preferred_movie_type`	`annual_income`	`occupation`
Action	66322	Doctor
Action	43873	Student
Thriller	2000	Student
Comedy	20360	Musician
Thriller	0	Housewife

Let's say we want to predict the preferred movie type based on the annual income and main occupation, employing our beloved linear regression. How would you proceed? How would you assign a numerical value to preferred_movie_type? Every possible choice would involve a high level of arbitrary decisions. There is no clear ranking within movie types, and introducing them just for the need of fitting our model would imply heavily manipulating our data and its original structure, with serious implications on the level of significance of our model's results.

You now have at least two reasons for which it is not advisable to employ linear regression for classification problems, and why we need specific models to perform classification tasks. One really small note before moving on—this holds true also for non-linear regression models, which I am not going to show you, but basically involve a non-linear regression between the response variable and explanatory variables.

Table of Contents for Linear regression limitations for categorical variables

Create new playlist

Sign In

Sign Up

Table of Contents for
Linear regression limitations for categorical variables