Learning process in MLPs

The multilayer perceptron network learns on the basis of the delta rule, which is also inspired by the gradient descent optimization method. The gradient method is broadly applied to find the minima or maxima of a given function. An example of evolution of a gradient based search method is shown in the following figure:

Learning process in MLPs

This method is applied at "walking," the direction where the function's output is higher or lower, depending on the criteria. This concept is explored in the delta rule.

Learning process in MLPs

The function that the delta rule wants to minimize is the error between the neural network output and the target output, and the parameters to be found are the neural weights. This is an enhanced learning algorithm compared to the perceptron rule, because it takes into account the activation function derivative g'(h), which in mathematical terms indicates the direction where the function is decreasing the most.

Simple and very powerful learning algorithm – Backpropagation

Although the delta rule works well for the neural networks having only output and input layers, for the MLP networks, the pure delta rule cannot be applied because of the hidden layer neurons. To overcome this issue, in the 1980s, Rummelhart et al. proposed a new algorithm, also inspired by a gradient method called backpropagation.

This algorithm is indeed a generalization of the delta rule for MLPs. The benefits of having additional layers to abstract more data from the environment have motivated the development of a training algorithm that can properly adjust the weights of the hidden layer. On the basis of the gradient method, the error from the output would be (back)propagated to the previous layers, thereby making the weight update using the same equation as the delta rule, possible. The algorithm runs according to the flowchart in the figure:

Simple and very powerful learning algorithm – Backpropagation

The second step is the backpropagation itself. What it does is find the weight variation according to the gradient, which is the base for the delta rule.

Simple and very powerful learning algorithm – Backpropagation

Where E is the error, wji is the weight between the neurons i and j, oi is the output of the ith neuron, hi is the weighted sum of that neuron's inputs before passing to the activation function. Remember that oi = f(hi), where f is the activation function.

Updating in the hidden layers is a bit more complicated as we consider the error as a function of all the neurons between the weight to be updated and the output. To facilitate this process, we should compute the sensibility or the backpropagation error:

Simple and very powerful learning algorithm – Backpropagation

Further, the weight update is as follows:

Simple and very powerful learning algorithm – Backpropagation

The calculation of the backpropagation error varies for the output and for the hidden layers as follows:

  • Backpropagation for the output layer
    Simple and very powerful learning algorithm – Backpropagation
    • Where oi is the ith output, ti is the desired ith output, f'(hi) is the derivative of the output activation function, and hi is the weighted sum of the ith neuron inputs.
  • Backpropagation for the hidden layer
    Simple and very powerful learning algorithm – Backpropagation
    • Where l is a neuron of the layer ahead, wil is the weight that connects the current neuron to the lth neuron of the layer immediately ahead.

For the sake of simplicity, we do not demonstrate fully how the backpropagation equation was developed. Anyway, if the reader is interested in the details, we recommend the references [Haykin, 2008; Rumelhart et al., 1986], which the reader can consult for further information.

This is how backpropagation works, enabling MLP networks to learn.

Elaborate and potent learning algorithm – Levenberg–Marquardt

The backpropagation algorithm, like all gradient-based methods, presents usually slow convergence, particularly when it falls in a zig-zag situation and when the weights are changed to almost the same value every two iterations. This drawback was studied in problems like curve-fitting interpolations by Kenneth Levenberg in 1944 and later by Donald Marquart in 1963, who developed a method for finding coefficients based on the Gauss–Newton algorithm and the gradient descent algorithm, so from there comes the name of the algorithm.

The algorithm deals with some optimization terms that are beyond the scope of this book, but in the references section, the reader will find good resources to learn more about these concepts, so we will present this method in a simpler way. Let's suppose that we have a list of inputs x's and outputs t's:

Elaborate and potent learning algorithm – Levenberg–Marquardt

We have seen that a neural network has the property to map inputs to outputs just like a nonlinear function f with coefficients W (weights and bias):

Elaborate and potent learning algorithm – Levenberg–Marquardt

The nonlinear function will produce values different from the outputs T because we marked the variable Y in the equation. The Levenberg–Marquardt algorithm works over a Jacobian matrix, which is a matrix of all partial derivatives with respect to each weight and bias for each data row. So, the Jacobian matrix has the following format:

Elaborate and potent learning algorithm – Levenberg–Marquardt

Where k is the total number of data points and p is the total number of weights and bias. In the Jacobian matrix, all weights and bias are stored serially in a single row. The elements of the Jacobian matrix are calculated from the gradients:

Elaborate and potent learning algorithm – Levenberg–Marquardt

The partial derivative of the error E in relation to each weight is calculated in the backpropagation algorithm, so this algorithm is going to run the backpropagation step as well.

In every optimization problem, one wishes to minimize the total error:

Elaborate and potent learning algorithm – Levenberg–Marquardt

Where W (weights and bias in the NN case) are the variables to optimize. The optimization algorithm updates W by adding ΔW. By applying some algebra, we can extend the last equation as follows:

Elaborate and potent learning algorithm – Levenberg–Marquardt

Converting to the vector and notation, we obtain:

Elaborate and potent learning algorithm – Levenberg–Marquardt

Finally, by setting the error E to zero, we get the Levenberg–Marquardt equation after some manipulation:

Elaborate and potent learning algorithm – Levenberg–Marquardt

Which is the weight update rule. As can be seen, it involves matrix operations such as transposition and inversion. The Greek letter λ is the damping factor, an equivalent of the learning rate.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset