Let's try to face a problem involving categorical variables with the only model you know—linear regression.
Let's imagine, for instance that you are dealing with a dataset showing; for a group of people, some explanatory variables sided with the obtained level of education articulated as high school, university, and doctor of philosophy:
Degree | Family income | Parents' highest degree |
High school | 93922 | University |
High school | 63019 | High school |
University | 51787 | University |
University | 31954 | High school |
University | 42681 | High school |
High school | 50378 | Doctor of philosophy |
Doctor of philosophy | 66107 | University |
How would you apply linear regression to this model? You would probably move the degree of education to a numerical scale. This is reasonable, but how to do that? You would probably resort to a scale from 1 to 3, having 1 as high school and 3 as doctor of philosophy:
n | degree | family_income | parents_highest_degree |
1 | High school | 93922 | University |
1 | High school | 63019 | High school |
2 | University | 51787 | University |
2 | University | 31954 | High school |
2 | University | 42681 | High school |
1 | High school | 50378 | Doctor of philosophy |
3 | Doctor of philosophy | 66107 | University |
Now, let's try to create a similar dataset in R and fit our regression model to this data:
education <- data.frame(n = c(1,1,2,2,2,1,3),
degree = c("high school","high school", "university",
"university", "university","high school","doctor of philosophy"),
family_income = c(93922,63019,51787,31954,42681,50378,66107),
parents_highest_degree = c("university","high school", "university" , "high school" , "high school" , "doctor of philosophy", "university"))
lm_model <- lm(n~ family_income + parents_highest_degree, data = education)
This actually results in a linear regression model. Even if you call summary() on it, you find out that it shows not astonishing performances, with an adjusted R-squared (you do remember why we should look at the adjusted one, don't you?) roughly equal to 30%.
But this is not the only problem with this model.
Imagine now that a new record is made available, regarding a student with a family income of 140,000 and university as the highest degree among the student's parents. If you try to predict the expected degree of education for that student, you will get a surprising -0.2029431. We can easily do this in R using the predict function:
predict(lm_model,newdata = data.frame(family_income = 140000, parents_highest_degree = "university"))
1
-0.2029431
What degree of education is that? This is the first kind of problem when trying to employ linear regression with categorical problems. But we also have another problem that our first set of data didn't show. Let's look at this dataset:
preferred_movie_type | annual_income | occupation |
Action | 66322 | Doctor |
Action | 43873 | Student |
Thriller | 2000 | Student |
Comedy | 20360 | Musician |
Thriller | 0 | Housewife |
Let's say we want to predict the preferred movie type based on the annual income and main occupation, employing our beloved linear regression. How would you proceed? How would you assign a numerical value to preferred_movie_type? Every possible choice would involve a high level of arbitrary decisions. There is no clear ranking within movie types, and introducing them just for the need of fitting our model would imply heavily manipulating our data and its original structure, with serious implications on the level of significance of our model's results.
You now have at least two reasons for which it is not advisable to employ linear regression for classification problems, and why we need specific models to perform classification tasks. One really small note before moving on—this holds true also for non-linear regression models, which I am not going to show you, but basically involve a non-linear regression between the response variable and explanatory variables.