So far, we have learned that simply minimizing the loss function (or equivalently maximizing the log likelihood function in the case of normal distribution) is not enough to develop a machine learning model for a given problem. One has to worry about models overfitting the training data, which will result in larger prediction errors on new datasets. The main advantage of Bayesian methods is that one can, in principle, get away from this problem, without using explicit regularization and different datasets for training and validation. This is called Bayesian model averaging and will be discussed here. This is one of the answers to our main question of the chapter, why Bayesian inference for machine learning?
For this, let's do a full Bayesian treatment of the linear regression problem. Since we only want to explain how Bayesian inference avoids the overfitting problem, we will skip all the mathematical derivations and state only the important results here. For more details, interested readers can refer to the book by Christopher M. Bishop (reference 2 in the References section of this chapter).
The linear regression equation , with having a normal distribution with zero mean and variance (equivalently, precision ), can be cast in a probability distribution form with Y having a normal distribution with mean f(X) and precision . Therefore, linear regression is equivalent to estimating the mean of the normal distribution:
Since , where the set of basis functions B(X) is known and we are assuming here that the noise parameter is also a known constant, only needs to be taken as an uncertain variable for a fully Bayesian treatment.
The first step in Bayesian inference is to compute a posterior distribution of parameter vector . For this, we assume that the prior distribution of is an M dimensional normal distribution (since there are M components) with mean and covariance matrix . As we have seen in Chapter 3, Introducing Bayesian Inference, this corresponds to taking a conjugate distribution for the prior:
The corresponding posterior distribution is given by:
Here, and .
Here, B is an N x M matrix formed by stacking basis vectors B, at different values of X, on top of each other as shown here:
Now that we have the posterior distribution for as a closed-form analytical expression, we can use it to predict new values of Y. To get an analytical closed-form expression for the predictive distribution of Y, we make an assumption that and . This corresponds to a prior with zero mean and isotropic covariance matrix characterized by one precision parameter . The predictive distribution or the probability that the prediction for a new value of X = x is y, is given by:
This equation is the central theme of this section. In the classical or frequentist approach, one estimates a particular value for the parameter from the training dataset and finds the probability of predicting y by simply using . This does not address the overfitting of the model unless regularization is used. In Bayesian inference, we are integrating out the parameter variable by using its posterior probability distribution learned from the data. This averaging will remove the necessity of using regularization or keeping the parameters to an optimal level through bias-variance tradeoff. This can be seen from the closed-form expression for P(y|x), after we substitute the expressions for and for the linear regression problem and do the integration. Since both are normal distributions, the integration can be done analytically that results in the following simple expression for P(y|x):
Here, .
This equation implies that the variance of the predictive distribution consists of two terms. One term, 1/, coming from the inherent noise in the data and the second term coming from the uncertainty associated with the estimation of model parameter from data. One can show that as the size of training data N becomes very large, the second term decreases and in the limit it becomes zero.
The example shown here illustrates the power of Bayesian inference. Since one can take care of uncertainty in the parameter estimation through Bayesian averaging, one doesn't need to keep separate validation data and all the data can be used for training. So, a full Bayesian treatment of a problem avoids the overfitting issue. Another major advantage of Bayesian inference, which we will not go into in this section, is treating latent variables in a machine learning model. In the next section, we will give a high-level overview of the various common machine learning tasks.