To set the neural network learning in a Bayesian context, consider the error function for the regression case. It can be treated as a Gaussian noise term for observing the given dataset conditioned on the weights w. This is precisely the likelihood function that can be written as follows:
Here, is the variance of the noise term given by and represents a probabilistic model. The regularization term can be considered as the log of the prior probability distribution over the parameters:
Here, is the variance of the prior distribution of weights. It can be easily shown using Bayes' theorem that the objective function M(w) then corresponds to the posterior distribution of parameters w:
In the neural network case, we are interested in the local maxima of . The posterior is then approximated as a Gaussian around each maxima , as follows:
Here, A is a matrix of the second derivative of M(w) with respect to w and represents an inverse of the covariance matrix. It is also known by the name Hessian matrix.
The value of hyper parameters and is found using the evidence framework. In this, the probability is used as a evidence to find the best values of and from data D. This is done through the following Bayesian rule:
By using the evidence framework and Gaussian approximation of posterior (references 2 and 5 in the References section of this chapter), one can show that the best value of satisfies the following:
Also, the best value of satisfies the following:
In these equations, is the number of well-determined parameters given by where k is the length of w.