Chapter 6. Bayesian Modeling – Linear Models

A linear regression model aims at explaining the behavior of one variable with another one, or several others, and by so doing, the assumption is that the relationship between the variables is linear. In general, the expectation of the target variable, the one you need to explain, is an affine transform of several other variables.

Linear models are presumably the most used statistical models, mainly because of their simplicity and the fact they have been studied for decades, leading to all possible extensions and analysis one can imagine. Basically all statistical packages, languages, or software implement linear regression models.

The idea of the model is really simple: a variable y is to be explained by several other variables xi by assuming a linear combination of x's—that is, a weighted sum of x's.

This model appeared in the 18th century in the work of Roger Joseph Boscovich. Then again, his method has been used by Pierre-Simon de Laplace, Adrien-Marie Legendre, and Carl Friedrich Gauss. It seems that Francis Galton, a mathematical genius of the 19th century, coined the term "linear regression".

The model is generally written as a linear combination of variables as follows:

y = β0 + β1x1 + β2x2 + … + βnxn + ϵ.

Here, y is the variable to explain, the x's are the explaining variables, and ϵ is a random noise that can be explained by the x's. It is generally a Gaussian-distributed random variable of mean 0 and variance σ2.

What does it mean in practice? The intuition behind this model is that each x after being rescaled will contribute a little bit to y. In other words, y is made of a sum of little pieces, each being an x.

There are many ways to estimate the value of the parameters from a dataset, and obviously in many situations the value of each parameter is of the utmost importance and needs to be carefully studied. The most used method is the least square method in which one tries to minimize the difference between the real y and its approximation by a sum of x's. Indeed, representing y as a sum of other variables is, as with many models, just an approximation of the reality. Many mathematical tools and algorithms have been developed for linear regression to answer the question of the quality of the model and its parameters.

Note

The word difference is just an analogy. In this case, the correct term is mean squared error, of course.

In this chapter, we will quickly cover the basics of linear regression. A full-scale study would be beyond the scope of this book and we assume the reader has been exposed to such models before.

The aim of this chapter is to go further and give a Bayesian flavor to the linear regression. In fact, in the standard model, one only focuses on the expectation of y and the parameters. But as soon as each of these components is considered as a random variable, it is possible to explain the linear regression in a Bayesian way and open oneself to many new techniques and benefits of dealing with full-probability distribution instead of just their expectations.

We will review the following elements in this chapter:

  • What is a linear regression and how to use it in R?
  • What are the main hypotheses in a linear regression and what to do when they break?
  • How to compute the parameters by hand and in R
  • How to interpret a linear regression as a probabilistic graphical model
  • How to estimate the parameters in the Bayesian way and what's the benefit of doing so?
  • A review of R packages for Bayesian linear regression
  • What is over-fitting, why is it so important to avoid it, and what is the Bayesian solution to it?

Linear regression

We start by looking at the most simple and most used model in statistics, which consists of fitting a straight line to a dataset. We assume we have a data set of pairs (xi, yi) that are i.i.d and we want to fit a model such that:

y = βx +β0 + ϵ

Here, ϵ is a Gaussian noise. If we assume that xi ϵ n then the expected value can also be written as:

Linear regression

Or, in matrix notation, we can also include the intercept β0 into the vector of parameters and add a column on 1 in X, such that X = (1, x1, …, xn) to finally obtain:

ŷ = XTβ

The following figure shows an example (in one dimension) of a data set with its corresponding regression line:

Linear regression

In R, fitting a linear model is an easy task, as we will see now. Here, we produce a small data set with an artificial number, in order to reproduce the previous figure. In R, the function to fit a linear model is lm() and it is the workhorse of this language in many situations. Of course, later in this chapter we will see more advanced algorithms:

N=30
x=runif(N,0,20)
y= 1.2*x + 4 + rnorm(N,0,4)
plot(x,y)
m=lm(y~x)
xx=seq(0,20,1)
lines(xx,xx*m$coefficients[2]+m$coefficients[1],col=2,lw=2)

In this example, we generate 30 random points between 0 and 20. Then we compute y, on a straight line of slope 1.2 and intercept 4, adding a random noise, with zero mean at a variance of 4.

We compute the model in m with the function lm(). Finally we plot the result.

Printing the variable m gives the following result:

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
      3.919        1.238  

Here we see that the intercept is 3.919, close to the value of 4, and the slope is 1.238, also close to 1.2. As we added an important noise (of variance 4), it is not surprising to see a difference between the theoretical model and the fitted model.

Estimating the parameters

In order to estimate the parameters, we need a discrepancy measure, or in other words, a function that measures some sort of difference between the model and the data set. Of course, the goal is to minimize this difference. Here the word difference has a very broad meaning and many functions could work. However, it has been found that one of the most useful and practical is the mean squared error defined by:

Estimating the parameters

To estimate the parameters of the model, we resort to a method we saw earlier, called the maximum likelihood. The likelihood is the probability of observing a data set given some parameters, in this case. We assume as usual the data are identically and independently distributed. This allows us to write the maximum likelihood as follows:

Estimating the parameters

Here, θ is the set of all parameters θ = {β1, …βp}.

Basically we want to find the parameters that maximize this probability. As we assume i.i.d data, we can write:

Estimating the parameters

Then, to simplify the computations we take the log-likelihood and have a sum instead of a big product:

Estimating the parameters

Next, we need an analytical form for the probability and in the linear regression, we saw the target data is Gaussian distributed as follows:

p(y | xθ) = N(y | XTθ, σ2)

If in the log-likelihood we replace the probability with the density function of a Gaussian, we will obtain:

Estimating the parameters

This long development leads to two terms in the last equation; one is the constant and the other one depends on the parameters. As our goal is to maximize this expression, we can get rid of the constant term without loss of generality. And because it easier (in most numerical packages) to minimize a function than to maximize it, we will take the negative of this:

Estimating the parameters

And we find the famous results again: the maximum likelihood estimator for a linear regression model is nothing but minimizing the square error!

In order to find the solution, we need to do a bit more linear algebra again. First we write this expression in matrix form to make things simpler:

Estimating the parameters

Here, X is the design matrix—that is, the matrix of all the data set. Each row i of the matrix is a vector Estimating the parameters. This form is very convenient so that we deal with the sum from within the scalar product.

After expanding this expression we obtain:

Estimating the parameters

This again can be simplified. Indeed, we are only interested in the term dependent on the parameters. So the rest can be discarded. Moreover, we know that y ϵ ℝN and the same for βTXT so we can write yTXβ = βTXTy, which helps us to simplify the main expression again to finally obtain:

Estimating the parameters

The minimum of this convex function is reached when its first derivative (in fact the Jacobian matrix) is equal to 0. So we derive it and obtain:

Estimating the parameters

Solving this equation to zero, the solution is finally:

Estimating the parameters

This is what the function lm() computes in R and most numerical languages or packages. However, we advise you not to implement it directly, especially when the data set is big. Indeed, inverting such a matrix like that could lead to a numerical instability that is hard to control.

The main problem when doing linear regression arises when the parameters are not stable and a little change can have massive effects on parameters. This could be due to collinearity between parameters, parameters canceling each other, or many other reasons.

One technique to solve this problem is known as shrinkage and its goal is to constrain the parameters to that they don't grow too far away. This usually gives better models with better predictive power. One simple approach to shrinkage is to put a Gaussian prior on the parameters. This technique is called ridge regression or L2 penalization.

Practically speaking, we assume the following priors on the parameters:

Estimating the parameters

Here, τ controls the amount of shrinkage one wants to apply to the model and the Gaussian distributions are zero-centered. The last point means we want the parameters not to go too far away from 0.

So, the optimization problem to solve becomes the following:

Estimating the parameters

To be brief, we give the solution directly, because the calculus is similar to the previous development. The negative log-likelihood is therefore:

Estimating the parameters

The only difference, really, is that the last term: Estimating the parameters controls the amount of penalization. The higher this term, the more the parameters will be penalized when they grow too much. A good amount of penalization can lead some of the parameters to become close to zero. In that case, it can be a good idea to try to fit the model again without those parameters. In a sense, it is a method for variable selection. A large τ will reduce λ, which follows the intuition that, if the prior on the parameters has a large variance, then, a broad range of values can be reached. If the variance is, on the contrary small then only a smaller range will have a high probability to be reached.

The solution to this optimization problem is:

Estimating the parameters

Again, we advise you not to compute this directly in R but rather to rely on packages that have been specifically implemented to be numerically stable. We recommend two packages:

  • MASS with the function lm.ridge(), which is similar to lm()
  • glmnet with its function, glmnet()

The second package implements several algorithms. You can use it for ridge regression, and for L1-penalization (Lasso). In L1-penalization, instead of using a Gaussian prior on the parameters, we have a Laplace prior. The Laplace distribution is very peaky in the center and this has a particular effect: collinear parameters can reach the value 0 exactly. In that case, they are simply eliminated from the model. It's a very powerful variable selection method.

However, the problem doesn't have an analytical solution and needs a specific optimization algorithm to find the solution.

In R, we can fit a model with lm() and glmnet() as follows, where we use the mtcars data set included in R directly, where we want to regress the variable mpg against the other variables:

m1 <- lm( mpg ~ . , mtcars)
m1
Call:
lm(formula = mpg ~ ., data = mtcars)

Coefficients:
(Intercept)          cyl         disp           hp         drat           wt  
   12.30337     -0.11144      0.01334     -0.02148      0.78711     -3.71530  
       qsec           vs           am         gear         carb  
    0.82104      0.31776      2.52023      0.65541     -0.19942  

We can plot the model and see the theoretical line with the real data set:

yhat <- ( as.matrix(mtcars[2:10] ) %*% m1$coefficients[2:10] ) + m1$coefficients[1]

Also note that we use the scalar product %*%:

plot(sort(mtcars$mpg)); lines(sort(yhat),col=2)
Estimating the parameters
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset