Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Learning process in MLPs

The multilayer perceptron network learns on the basis of the delta rule, which is also inspired by the gradient descent optimization method. The gradient method is broadly applied to find the minima or maxima of a given function. An example of evolution of a gradient based search method is shown in the following figure:

This method is applied at "walking," the direction where the function's output is higher or lower, depending on the criteria. This concept is explored in the delta rule.

The function that the delta rule wants to minimize is the error between the neural network output and the target output, and the parameters to be found are the neural weights. This is an enhanced learning algorithm compared to the perceptron rule, because it takes into account the activation function derivative g'(h), which in mathematical terms indicates the direction where the function is decreasing the most.

Simple and very powerful learning algorithm – Backpropagation

Although the delta rule works well for the neural networks having only output and input layers, for the MLP networks, the pure delta rule cannot be applied because of the hidden layer neurons. To overcome this issue, in the 1980s, Rummelhart et al. proposed a new algorithm, also inspired by a gradient method called backpropagation.

This algorithm is indeed a generalization of the delta rule for MLPs. The benefits of having additional layers to abstract more data from the environment have motivated the development of a training algorithm that can properly adjust the weights of the hidden layer. On the basis of the gradient method, the error from the output would be (back)propagated to the previous layers, thereby making the weight update using the same equation as the delta rule, possible. The algorithm runs according to the flowchart in the figure:

Simple and very powerful learning algorithm – Backpropagation

The second step is the backpropagation itself. What it does is find the weight variation according to the gradient, which is the base for the delta rule.

Where E is the error, w_ji is the weight between the neurons i and j, o_i is the output of the i^th neuron, h_i is the weighted sum of that neuron's inputs before passing to the activation function. Remember that o_i = f(h_i), where f is the activation function.

Updating in the hidden layers is a bit more complicated as we consider the error as a function of all the neurons between the weight to be updated and the output. To facilitate this process, we should compute the sensibility or the backpropagation error:

Further, the weight update is as follows:

The calculation of the backpropagation error varies for the output and for the hidden layers as follows:

Backpropagation for the output layer
- Where o_i is the i^th output, t_i is the desired i^th output, f'(h_i) is the derivative of the output activation function, and h_i is the weighted sum of the i^th neuron inputs.
Backpropagation for the hidden layer
- Where l is a neuron of the layer ahead, w_il is the weight that connects the current neuron to the l^th neuron of the layer immediately ahead.

For the sake of simplicity, we do not demonstrate fully how the backpropagation equation was developed. Anyway, if the reader is interested in the details, we recommend the references [Haykin, 2008; Rumelhart et al., 1986], which the reader can consult for further information.

This is how backpropagation works, enabling MLP networks to learn.

Elaborate and potent learning algorithm – Levenberg–Marquardt

The backpropagation algorithm, like all gradient-based methods, presents usually slow convergence, particularly when it falls in a zig-zag situation and when the weights are changed to almost the same value every two iterations. This drawback was studied in problems like curve-fitting interpolations by Kenneth Levenberg in 1944 and later by Donald Marquart in 1963, who developed a method for finding coefficients based on the Gauss–Newton algorithm and the gradient descent algorithm, so from there comes the name of the algorithm.

The algorithm deals with some optimization terms that are beyond the scope of this book, but in the references section, the reader will find good resources to learn more about these concepts, so we will present this method in a simpler way. Let's suppose that we have a list of inputs x's and outputs t's:

Elaborate and potent learning algorithm – Levenberg–Marquardt

We have seen that a neural network has the property to map inputs to outputs just like a nonlinear function f with coefficients W (weights and bias):

The nonlinear function will produce values different from the outputs T because we marked the variable Y in the equation. The Levenberg–Marquardt algorithm works over a Jacobian matrix, which is a matrix of all partial derivatives with respect to each weight and bias for each data row. So, the Jacobian matrix has the following format:

Where k is the total number of data points and p is the total number of weights and bias. In the Jacobian matrix, all weights and bias are stored serially in a single row. The elements of the Jacobian matrix are calculated from the gradients:

The partial derivative of the error E in relation to each weight is calculated in the backpropagation algorithm, so this algorithm is going to run the backpropagation step as well.

In every optimization problem, one wishes to minimize the total error:

Where W (weights and bias in the NN case) are the variables to optimize. The optimization algorithm updates W by adding ΔW. By applying some algebra, we can extend the last equation as follows:

Converting to the vector and notation, we obtain:

Finally, by setting the error E to zero, we get the Levenberg–Marquardt equation after some manipulation:

Which is the weight update rule. As can be seen, it involves matrix operations such as transposition and inversion. The Greek letter λ is the damping factor, an equivalent of the learning rate.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Learning process in MLPs

Create new playlist

Sign In

Sign Up

Learning process in MLPs

Simple and very powerful learning algorithm – Backpropagation

Elaborate and potent learning algorithm – Levenberg–Marquardt

Table of Contents for
Learning process in MLPs