The multilayer perceptron network learns on the basis of the delta rule, which is also inspired by the gradient descent optimization method. The gradient method is broadly applied to find the minima or maxima of a given function. An example of evolution of a gradient based search method is shown in the following figure:
This method is applied at "walking," the direction where the function's output is higher or lower, depending on the criteria. This concept is explored in the delta rule.
The function that the delta rule wants to minimize is the error between the neural network output and the target output, and the parameters to be found are the neural weights. This is an enhanced learning algorithm compared to the perceptron rule, because it takes into account the activation function derivative g'(h), which in mathematical terms indicates the direction where the function is decreasing the most.
Although the delta rule works well for the neural networks having only output and input layers, for the MLP networks, the pure delta rule cannot be applied because of the hidden layer neurons. To overcome this issue, in the 1980s, Rummelhart et al. proposed a new algorithm, also inspired by a gradient method called backpropagation.
This algorithm is indeed a generalization of the delta rule for MLPs. The benefits of having additional layers to abstract more data from the environment have motivated the development of a training algorithm that can properly adjust the weights of the hidden layer. On the basis of the gradient method, the error from the output would be (back)propagated to the previous layers, thereby making the weight update using the same equation as the delta rule, possible. The algorithm runs according to the flowchart in the figure:
The second step is the backpropagation itself. What it does is find the weight variation according to the gradient, which is the base for the delta rule.
Where E is the error, wji is the weight between the neurons i and j, oi is the output of the ith neuron, hi is the weighted sum of that neuron's inputs before passing to the activation function. Remember that oi = f(hi), where f is the activation function.
Updating in the hidden layers is a bit more complicated as we consider the error as a function of all the neurons between the weight to be updated and the output. To facilitate this process, we should compute the sensibility or the backpropagation error:
Further, the weight update is as follows:
The calculation of the backpropagation error varies for the output and for the hidden layers as follows:
For the sake of simplicity, we do not demonstrate fully how the backpropagation equation was developed. Anyway, if the reader is interested in the details, we recommend the references [Haykin, 2008; Rumelhart et al., 1986], which the reader can consult for further information.
This is how backpropagation works, enabling MLP networks to learn.
The backpropagation algorithm, like all gradient-based methods, presents usually slow convergence, particularly when it falls in a zig-zag situation and when the weights are changed to almost the same value every two iterations. This drawback was studied in problems like curve-fitting interpolations by Kenneth Levenberg in 1944 and later by Donald Marquart in 1963, who developed a method for finding coefficients based on the Gauss–Newton algorithm and the gradient descent algorithm, so from there comes the name of the algorithm.
The algorithm deals with some optimization terms that are beyond the scope of this book, but in the references section, the reader will find good resources to learn more about these concepts, so we will present this method in a simpler way. Let's suppose that we have a list of inputs x's and outputs t's:
We have seen that a neural network has the property to map inputs to outputs just like a nonlinear function f with coefficients W (weights and bias):
The nonlinear function will produce values different from the outputs T because we marked the variable Y in the equation. The Levenberg–Marquardt algorithm works over a Jacobian matrix, which is a matrix of all partial derivatives with respect to each weight and bias for each data row. So, the Jacobian matrix has the following format:
Where k is the total number of data points and p is the total number of weights and bias. In the Jacobian matrix, all weights and bias are stored serially in a single row. The elements of the Jacobian matrix are calculated from the gradients:
The partial derivative of the error E in relation to each weight is calculated in the backpropagation algorithm, so this algorithm is going to run the backpropagation step as well.
In every optimization problem, one wishes to minimize the total error:
Where W (weights and bias in the NN case) are the variables to optimize. The optimization algorithm updates W by adding ΔW. By applying some algebra, we can extend the last equation as follows:
Converting to the vector and notation, we obtain:
Finally, by setting the error E to zero, we get the Levenberg–Marquardt equation after some manipulation:
Which is the weight update rule. As can be seen, it involves matrix operations such as transposition and inversion. The Greek letter λ is the damping factor, an equivalent of the learning rate.