The learning process

So far, we have theoretically defined the learning process and how it is carried out. But in practice, we must dive a little bit deeper into the mathematical logic, in order to implement the learning algorithm itself. For simplicity, in this chapter, we are basically covering the supervised learning case; however, we will present here a rule for updating weights in unsupervised learning. A learning algorithm is a procedure that drives the learning process of neural networks, and it is strongly determined by the neural network architecture. From the mathematical point of view, one wishes to find the optimal weights W that can drive the cost function C(X, Y) to the lowest possible value. However, sometimes the learning process cannot find a good set of weights capable of meeting the acceptance criteria, but a stop condition must be set to prevent the neural network from learning forever and thereby causing the Java program to freeze.

In general, this process is carried out in the fashion presented in the following flowchart:

The learning process

The cost function finding the way down to the optimum

Now let's find out in detail what role the cost function plays. Let's think of cost function as a two-variable function whose shape is represented by a hypersurface. For simplicity, let's consider for now only two weights (two-dimensional space plus height representing cost function). Suppose our cost function has the following shape:

The cost function finding the way down to the optimum

Visually, we can see that there is an optimum, by which the cost function roughly approaches zero. But how can we make this programmatically? The answer lies in the mathematical optimization, whereby the cost function is defined as an optimization problem:

The cost function finding the way down to the optimum

By recalling the optimization Fermat's theorems, the optimal solution lies in a place where the surface slope should be zero at all dimensions, that is, the partial derivative should be zero, and it should be convex (for the minimum case). Considering that one starts with an arbitrary solution W, the search for the optimum should take into account the direction to which the surface height is going down. This is the so-called gradient method.

Learning in progress - weight update

According to the cost function used, an update rule will dictate how the weights, the neural flexible parameters, should be changed, so the cost function will have a lower value at the new weights:

Learning in progress - weight update

Here, k refers to the kth iteration and W(k) refers to the neural weights at the kth iteration, and subsequently k+1 refers to the next iteration.

The weight update operation can be performed in online or batch mode. Online here implies that the weights are updated after every single record from the dataset. Batch update means that first all the records from the dataset are presented to the neural network before it starts updating its weights. This will be explored in detail in the code at the end of this chapter.

Calculating the cost function

When a neural network learns, it receives data from an environment and adapts its weights according to the objective. This data is referred to as the training dataset and has several samples. The idea behind the word training lies in the process of adapting the neural weights, as if they were training to give the desired response in the neural network. While the neural network is still learning, there is an error between the target outputs (Y) and the neural outputs (), in the supervised case:

Calculating the cost function

Tip

Some literature about neural networks identifies the target variable with the letter T, and the neural output as Y, while in this book we are going to the denote it as Y and, to not confuse the reader, since it was presented initially as Y.

Well, given that the training dataset has multiple values, there will be N values of errors for each single record. So, how to get an overall error? One intuitive approach is to get an average of all errors, but this is misleading. The error vector can take on both positive and negative values, therefore an average of all error values is very likely to be closer to zero, regardless of how big the error measurements may be. Using the absolute value to generate an average seems to be a smarter approach, but this function has a discontinuity at the origin, what is awkward in calculating its derivative:

Calculating the cost function

So, the reasonable option we have is to use the average of a quadratic sum of the error, also known as mean squared error (MSE):

Calculating the cost function

General error and overall error

We need to clarify one thing before going further. The neural network being a multiple output structure, we have to deal with the multiple output case, when instead of an error vector, we will have an error matrix:

General error and overall error

Well, in such cases, there may be a huge number of errors to work with, whether regarding one specific output, a specific record, or the whole dataset. To facilitate understanding, let's call the specific-to-record error the general error, by which all output errors are given one scalar for the general output error; and the error referring to the whole data as overall error.

The general error for single output network is a mere difference between target and output, but in the multiple output case, it needs be composed of each output error. As we saw, the squared error is a suitable approach to summarize error measures, therefore the general error can be calculated using the square of each output error:

General error and overall error

As for the overall error, it actually considers the general error but for all records in the dataset. Since the dataset can be huge, it is better to calculate the overall error using the MSE of the quadratic general errors.

Can the neural network learn forever? When is it good to stop?

As the learning process is run, the neural network must give results closer and closer to the expectation, until finally it reaches the acceptation criteria or one limitation in learning iterations, that we'll call epochs. The learning process is then considered to be finished when one of these conditions is met:

  • Satisfaction criterion: minimum overall error or minimum weight distance, according to the learning paradigm
  • Maximum number of epochs
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset