A linear regression model aims at explaining the behavior of one variable with another one, or several others, and by so doing, the assumption is that the relationship between the variables is linear. In general, the expectation of the target variable, the one you need to explain, is an affine transform of several other variables.
Linear models are presumably the most used statistical models, mainly because of their simplicity and the fact they have been studied for decades, leading to all possible extensions and analysis one can imagine. Basically all statistical packages, languages, or software implement linear regression models.
The idea of the model is really simple: a variable y is to be explained by several other variables xi by assuming a linear combination of x's—that is, a weighted sum of x's.
This model appeared in the 18th century in the work of Roger Joseph Boscovich. Then again, his method has been used by Pierre-Simon de Laplace, Adrien-Marie Legendre, and Carl Friedrich Gauss. It seems that Francis Galton, a mathematical genius of the 19th century, coined the term "linear regression".
The model is generally written as a linear combination of variables as follows:
y = β0 + β1x1 + β2x2 + … + βnxn + ϵ.
Here, y is the variable to explain, the x's are the explaining variables, and ϵ is a random noise that can be explained by the x's. It is generally a Gaussian-distributed random variable of mean 0 and variance σ2.
What does it mean in practice? The intuition behind this model is that each x after being rescaled will contribute a little bit to y. In other words, y is made of a sum of little pieces, each being an x.
There are many ways to estimate the value of the parameters from a dataset, and obviously in many situations the value of each parameter is of the utmost importance and needs to be carefully studied. The most used method is the least square method in which one tries to minimize the difference between the real y and its approximation by a sum of x's. Indeed, representing y as a sum of other variables is, as with many models, just an approximation of the reality. Many mathematical tools and algorithms have been developed for linear regression to answer the question of the quality of the model and its parameters.
In this chapter, we will quickly cover the basics of linear regression. A full-scale study would be beyond the scope of this book and we assume the reader has been exposed to such models before.
The aim of this chapter is to go further and give a Bayesian flavor to the linear regression. In fact, in the standard model, one only focuses on the expectation of y and the parameters. But as soon as each of these components is considered as a random variable, it is possible to explain the linear regression in a Bayesian way and open oneself to many new techniques and benefits of dealing with full-probability distribution instead of just their expectations.
We will review the following elements in this chapter:
We start by looking at the most simple and most used model in statistics, which consists of fitting a straight line to a dataset. We assume we have a data set of pairs (xi, yi) that are i.i.d and we want to fit a model such that:
y = βx +β0 + ϵ
Here, ϵ is a Gaussian noise. If we assume that xi ϵ ℝn then the expected value can also be written as:
Or, in matrix notation, we can also include the intercept β0 into the vector of parameters and add a column on 1 in X, such that X = (1, x1, …, xn) to finally obtain:
ŷ = XTβ
The following figure shows an example (in one dimension) of a data set with its corresponding regression line:
In R, fitting a linear model is an easy task, as we will see now. Here, we produce a small data set with an artificial number, in order to reproduce the previous figure. In R, the function to fit a linear model is lm()
and it is the workhorse of this language in many situations. Of course, later in this chapter we will see more advanced algorithms:
N=30 x=runif(N,0,20) y= 1.2*x + 4 + rnorm(N,0,4) plot(x,y) m=lm(y~x) xx=seq(0,20,1) lines(xx,xx*m$coefficients[2]+m$coefficients[1],col=2,lw=2)
In this example, we generate 30 random points between 0 and 20. Then we compute y
, on a straight line of slope 1.2 and intercept 4, adding a random noise, with zero mean at a variance of 4.
We compute the model in m
with the function lm()
. Finally we plot the result.
Printing the variable m
gives the following result:
Call: lm(formula = y ~ x) Coefficients: (Intercept) x 3.919 1.238
Here we see that the intercept is 3.919, close to the value of 4, and the slope is 1.238, also close to 1.2. As we added an important noise (of variance 4), it is not surprising to see a difference between the theoretical model and the fitted model.
In order to estimate the parameters, we need a discrepancy measure, or in other words, a function that measures some sort of difference between the model and the data set. Of course, the goal is to minimize this difference. Here the word difference has a very broad meaning and many functions could work. However, it has been found that one of the most useful and practical is the mean squared error defined by:
To estimate the parameters of the model, we resort to a method we saw earlier, called the maximum likelihood. The likelihood is the probability of observing a data set given some parameters, in this case. We assume as usual the data are identically and independently distributed. This allows us to write the maximum likelihood as follows:
Here, θ is the set of all parameters θ = {β1, …βp}.
Basically we want to find the parameters that maximize this probability. As we assume i.i.d data, we can write:
Then, to simplify the computations we take the log-likelihood and have a sum instead of a big product:
Next, we need an analytical form for the probability and in the linear regression, we saw the target data is Gaussian distributed as follows:
p(y | xθ) = N(y | XTθ, σ2)
If in the log-likelihood we replace the probability with the density function of a Gaussian, we will obtain:
This long development leads to two terms in the last equation; one is the constant and the other one depends on the parameters. As our goal is to maximize this expression, we can get rid of the constant term without loss of generality. And because it easier (in most numerical packages) to minimize a function than to maximize it, we will take the negative of this:
And we find the famous results again: the maximum likelihood estimator for a linear regression model is nothing but minimizing the square error!
In order to find the solution, we need to do a bit more linear algebra again. First we write this expression in matrix form to make things simpler:
Here, X is the design matrix—that is, the matrix of all the data set. Each row i of the matrix is a vector . This form is very convenient so that we deal with the sum from within the scalar product.
After expanding this expression we obtain:
This again can be simplified. Indeed, we are only interested in the term dependent on the parameters. So the rest can be discarded. Moreover, we know that y ϵ ℝN and the same for βTXT so we can write yTXβ = βTXTy, which helps us to simplify the main expression again to finally obtain:
The minimum of this convex function is reached when its first derivative (in fact the Jacobian matrix) is equal to 0. So we derive it and obtain:
Solving this equation to zero, the solution is finally:
This is what the function lm()
computes in R and most numerical languages or packages. However, we advise you not to implement it directly, especially when the data set is big. Indeed, inverting such a matrix like that could lead to a numerical instability that is hard to control.
The main problem when doing linear regression arises when the parameters are not stable and a little change can have massive effects on parameters. This could be due to collinearity between parameters, parameters canceling each other, or many other reasons.
One technique to solve this problem is known as shrinkage and its goal is to constrain the parameters to that they don't grow too far away. This usually gives better models with better predictive power. One simple approach to shrinkage is to put a Gaussian prior on the parameters. This technique is called ridge regression or L2 penalization.
Practically speaking, we assume the following priors on the parameters:
Here, τ controls the amount of shrinkage one wants to apply to the model and the Gaussian distributions are zero-centered. The last point means we want the parameters not to go too far away from 0.
So, the optimization problem to solve becomes the following:
To be brief, we give the solution directly, because the calculus is similar to the previous development. The negative log-likelihood is therefore:
The only difference, really, is that the last term: controls the amount of penalization. The higher this term, the more the parameters will be penalized when they grow too much. A good amount of penalization can lead some of the parameters to become close to zero. In that case, it can be a good idea to try to fit the model again without those parameters. In a sense, it is a method for variable selection. A large τ will reduce λ, which follows the intuition that, if the prior on the parameters has a large variance, then, a broad range of values can be reached. If the variance is, on the contrary small then only a smaller range will have a high probability to be reached.
The solution to this optimization problem is:
Again, we advise you not to compute this directly in R but rather to rely on packages that have been specifically implemented to be numerically stable. We recommend two packages:
lm.ridge()
, which is similar to lm()
glmnet
with its function, glmnet()
The second package implements several algorithms. You can use it for ridge regression, and for L1-penalization (Lasso). In L1-penalization, instead of using a Gaussian prior on the parameters, we have a Laplace prior. The Laplace distribution is very peaky in the center and this has a particular effect: collinear parameters can reach the value 0 exactly. In that case, they are simply eliminated from the model. It's a very powerful variable selection method.
However, the problem doesn't have an analytical solution and needs a specific optimization algorithm to find the solution.
In R, we can fit a model with lm()
and glmnet()
as follows, where we use the mtcars
data set included in R directly, where we want to regress the variable mpg
against the other variables:
m1 <- lm( mpg ~ . , mtcars) m1 Call: lm(formula = mpg ~ ., data = mtcars) Coefficients: (Intercept) cyl disp hp drat wt 12.30337 -0.11144 0.01334 -0.02148 0.78711 -3.71530 qsec vs am gear carb 0.82104 0.31776 2.52023 0.65541 -0.19942
We can plot the model and see the theoretical line with the real data set:
yhat <- ( as.matrix(mtcars[2:10] ) %*% m1$coefficients[2:10] ) + m1$coefficients[1]
Also note that we use the scalar product %*%
:
plot(sort(mtcars$mpg)); lines(sort(yhat),col=2)