Neural networks' revenge

Because of the vanishing gradient problem, neural networks lost their popularity in the field of machine learning. We can say that the number of cases used for data mining in the real world by neural networks was remarkably small compared to other typical algorithms such as logistic regression and SVM.

But then deep learning showed up and broke all the existing conventions. As you know, deep learning is the neural network accumulating layers. In other words, it is deep neural networks, and it generates astounding predictability in certain fields. Now, speaking of AI research, it's no exaggeration to say that it's the research into deep neural networks. Surely it's the counterattack by neural networks. If so, why didn't the vanishing gradient problem matter in deep learning? What's the difference between this and the past algorithm?

In this section, we'll look at why deep learning can generate such predictability and its mechanisms.

Deep learning's evolution – what was the breakthrough?

We can say that there are two algorithms that triggered deep learning's popularity. The first one, as mentioned in Chapter 1, Deep Learning Overview, is DBN pioneered by Professor Hinton (https://www.cs.toronto.edu/~hinton/absps/fastnc.pdf). The second one is SDA, proposed by Vincent et al. (http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf). SDA was introduced a little after the introduction of DBN. It also recorded high predictability even with deep layers by taking a similar approach to DBN, although the details of the algorithm are different.

So, what is the common approach that solved the vanishing gradient problem? Perhaps you are nervously preparing to solve difficult equations in order to understand DBN or SDA, but don't worry. DBN is definitely an algorithm that is understandable. On the contrary, the mechanism itself is really simple. Deep learning was established by a very simple and elegant solution. The solution is: layer-wise training. That's it. You might think it's obvious if you see it, but this is the approach that made deep learning popular.

As mentioned earlier, in theory if there are more units or layers of neural networks, it should have more expressions and increase the number of problems it is able to solve. It doesn't work well because an error cannot be fed back to each layer correctly and parameters, as a whole network, cannot be adjusted properly. This is where the innovation was brought in for learning at a respective layer. Because each layer adjusts the weights of the networks independently, the whole network (that is, the parameters of the model) can be optimized properly even though the numbers of layers are piled up.

Previous models didn't go well because they tried to backpropagate errors from an output layer to an input layer straight away and tried to optimize themselves by adjusting the weights of the network with backpropagated errors. So, the algorithm shifted to layer-wise training and then the model optimization went well. That's what the breakthrough was for deep learning.

However, although we simply say layer-wise training, we need techniques for how to implement the learning. Also, as a matter of course, parameter adjustments for whole networks can't only be done with layer-wise training. We need the final adjustment. This phase of layer-wise training is called pre-training and the last adjustment phase is called fine-tuning. We can say that the bigger feature introduced in DBN and SDA is pre-training, but these two features are both part of the the necessary flow of deep learning. How do we do pre-training? What can be done in fine-tuning? Let's take a look at these questions one by one.

Deep learning with pre-training

Deep learning is more like neural networks with accumulated hidden layers. The layer-wise training in pre-training undertakes learning at each layer. However, you might still have the following questions: if both layers are hidden (that is, neither of the layers are input nor output layers), then how is the training done? What can the input and output be?

Before thinking of these questions, remind yourself of the following point again (reiterated persistently): deep learning is neural networks with piled up layers. This mean, model parameters are still the weights of the network (and bias) in deep learning. Since these weights (and bias) need to be adjusted among each layer, in the standard three layered neural network (that is, the input layer, the hidden layer, and the output layer), we need to optimize only the weights between the input layer and the hidden layer and between the hidden layer and the output layer. In deep learning, however, the weight between two hidden layers also needs to be adjusted.

First of all, let's think about the input of a layer. You can imagine this easily with a quick thought. The value propagated from the previous layer will become the input as it is. The value propagated from the previous layer is none other than the value forward propagated from the previous layers to the current layer by using the weight of the network, the same as in general feed-forward networks. It looks simple in writing, but you can see that it has an important meaning if you step into it further and try to understand what it means. The value from the previous layer becomes the input, which means that the features the previous layer(s) learned become the input of the current layer, and from there the current layer newly learns the feature of the given data. In other words, in deep learning, features are learned from the input data in stages (and semi-automatically). This implies a mechanism where the deeper a layer becomes, the higher the feature it learns. This is what normal multi-layer neural networks couldn't do and the reason why it is said "a machine can learn a concept."

Now, let's think about the output. Please bear in mind that thinking about the output means thinking about how it learns. DBN and SDA have completely different approaches to learning, but both fill the following condition: to learn in order to equate output values and input values. You might think "What are you talking about?" but this is the technique that makes deep learning possible.

The value comes and goes back to the input layer through the hidden layer, and the technique is to adjust the weight of the networks (that is, to equate the output value and the input value) to eliminate the error at that time. The graphical model can be illustrated as follows:

Deep learning with pre-training

It looks different from standard neural networks at a glance, but there's nothing special. If we intentionally draw the diagram of the input layer and the output layer separately, the mechanism is the same shape as the normal neural network:

Deep learning with pre-training

For a human, this action of matching input and output is not intuitive, but for a machine it is a valid action. If so, how it can learn features from input data by matching the output layer and input layer?

Need a little explanation? Let's think about it this way: in the algorithm of machine learning, including neural networks, learning intends to minimize errors between the model's prediction output and the dataset output. The mechanism is to remove an error by finding a pattern from the input data and making data with a common pattern the same output value (for example, 0 or 1). What would then happen if we turned the output value into the input value?

When we look at problems that should be solved as a whole through deep learning, input data is, fundamentally, a dataset that can be divided into some patterns. This means that there are some common features in the input data. If so, in the process of learning where each output value becomes respective input data, the weight of networks should be adjusted to focus more on the part that reflects the common features. And, even within the data categorized in the same class, learning should be processed to reduce weight on the non-common feature part, that is, the noise part.

Now you should understand what the input and output is in a certain layer and how learning progresses. Once the pre-training is done at a certain layer, the network moves on to learning in the next layer. However, as you can see in the following images, please also keep in mind that a hidden layer becomes an input layer when the network moves to learning in the next layer:

Deep learning with pre-training

The point here is that the layer after the pre-training can be treated as normal feed-forward neural networks where the weight of the networks is adjusted. Hence, if we think about the input value, we can simply calculate the value forward propagated from the input layer to the current layer through the network.

Up to now, we've looked through the flow of pre-training (that is, layer-wise training). In the hidden layers of deep neural networks, features of input data are extracted in stages through learning where the input matches the output. Now, some of you might be wondering: I understand that features can be learned in stages from input data by pre-training, but that alone doesn't solve the classification problem. So, how can it solve the classification problem?

Well, during pre-training, the information pertaining to which data belongs to which class is not provided. This means the pre-training is unsupervised training and it just analyzes the hidden pattern using only input data. This is meaningless if it can't be used to solve the problem however it extracts features. Therefore, the model needs to complete one more step to solve classification problems properly. That is fine-tuning. The main roles of fine-tuning are the following:

  1. To add an output layer to deep neural networks that completed pre-training and to perform supervised training.
  2. To do final adjustments for the whole deep neural network.

This can be illustrated as follows:

Deep learning with pre-training

The supervised training in an output layer uses a machine learning algorithm, such as logistic regression or SVM. Generally, logistic regression is used more often considering the balance of the amount of calculation and the precision gained.

In fine-tuning, sometimes only the weights of an output layer will be adjusted, but normally the weights of whole neural networks, including the layer where the weights have been adjusted in pre-training, will also be adjusted. This means the standard learning algorithm, or in other words the backpropagation algorithm, is applied to the deep neural networks just as one multi-layer neural network. Thus, the model of neural networks with the problem of solving more complicated classification is completed.

Even so, you might have the following questions: why does learning go well with the standard backpropagation algorithm even in multi-layer neural networks where layers are piled up? Doesn't the vanishing gradient problem occur? These questions can be solved by pre-training. Let's think about the following: in the first place, the problem is that the weights of each network are not correctly adjusted due to improperly fed back errors in multi-layer neural networks without pre-training; in other words, the multi-layer neural networks where the vanishing gradient problem occurs. On the other hand, once the pre-training is done, the learning starts from the point where the weight of the network is almost already adjusted. Therefore, a proper error can be propagated to a layer close to an input layer. Hence the name fine-tuning. Thus, through pre-training and fine-tuning, eventually deep neural networks become neural networks with increased expression by having deep layers.

From the next section onwards, we will finally look through the theory and implementation of DBN and SDA, the algorithms of deep learning. But before that, let's look back at the flow of deep learning once again. Below is the summarized diagram of the flow:

Deep learning with pre-training

The parameters of the model are optimized layer by layer during pre-training and then adjusted as single deep neural networks during fine-tuning. Deep learning, the breakthrough of AI, is a very simple algorithm.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset