RNNs for modeling sequences

In this section, now that we understand sequences, we can look at the foundations of RNNs. We'll start by introducing the typical structure of an RNN, and we'll see how the data flows through it with one or more hidden layers. We'll then examine how the neuron activations are computed in a typical RNN. This will create a context for us to discuss the common challenges in training RNNs, and explore the modern solution to these challenges—LSTM.

Understanding the structure and flow of an RNN

Let's start by introducing the architecture of an RNN. The following figure shows a standard feedforward neural network and an RNN, in a side by side for comparison:

Understanding the structure and flow of an RNN

Both of these networks have only one hidden layer. In this representation, the units are not displayed, but we assume that the input layer (x), hidden layer (h), and output layer (y) are vectors which contain many units.

Note

This generic RNN architecture could correspond to the two sequence modeling categories where the input is a sequence. Thus, it could be either many-to-many if we consider Understanding the structure and flow of an RNN as the final output, or it could be many-to-one if, for example, we only use the last element of Understanding the structure and flow of an RNN as the final output.

Later, we will see how the output sequence Understanding the structure and flow of an RNN can be converted into standard, nonsequential output.

In a standard feedforward network, information flows from the input to the hidden layer, and then from the hidden layer to the output layer. On the other hand, in a recurrent network, the hidden layer gets its input from both the input layer and the hidden layer from the previous time step.

The flow of information in adjacent time steps in the hidden layer allows the network to have a memory of past events. This flow of information is usually displayed as a loop, also known as a recurrent edge in graph notation, which is how this general architecture got its name.

In the following figure, the single hidden layer network and the multilayer network illustrate two contrasting architectures:

Understanding the structure and flow of an RNN

In order to examine the architecture of RNNs and the flow of information, a compact representation with a recurrent edge can be unfolded, which you can see in the preceding figure.

As we know, each hidden unit in a standard neural network receives only one input—the net preactivation associated with the input layer. Now, in contrast, each hidden unit in an RNN receives two distinct sets of input—the preactivation from the input layer and the activation of the same hidden layer from the previous time step t-1.

At the first time step t = 0, the hidden units are initialized to zeros or small random values. Then, at a time step where t > 0, the hidden units get their input from the data point at the current time Understanding the structure and flow of an RNN and the previous values of hidden units at t - 1, indicated as Understanding the structure and flow of an RNN.

Similarly, in the case of a multilayer RNN, we can summarize the information flow as follows:

  • Understanding the structure and flow of an RNN : Here, the hidden layer is represented as Understanding the structure and flow of an RNN and gets its input from the data point Understanding the structure and flow of an RNN and the hidden values in the same layer, but the previous time step Understanding the structure and flow of an RNN
  • Understanding the structure and flow of an RNN: The second hidden layer, Understanding the structure and flow of an RNN receives its inputs from the hidden units from the layer below at the current time step (Understanding the structure and flow of an RNN) and its own hidden values from the previous time step Understanding the structure and flow of an RNN

Computing activations in an RNN

Now that we understand the structure and general flow of information in an RNN, let's get more specific and compute the actual activations of the hidden layers as well as the output layer. For simplicity, we'll consider just a single hidden layer; however, the same concept applies to multilayer RNNs.

Each directed edge (the connections between boxes) in the representation of an RNN that we just looked at is associated with a weight matrix. Those weights do not depend on time t; therefore, they are shared across the time axis. The different weight matrices in a single layer RNN are as follows:

  • Computing activations in an RNN : The weight matrix between the input Computing activations in an RNN and the hidden layer h
  • Computing activations in an RNN : The weight matrix associated with the recurrent edge
  • Computing activations in an RNN: The weight matrix between the hidden layer and output layer

You can see these weight matrices in the following figure:

Computing activations in an RNN

In certain implementations, you may observe that weight matrices Computing activations in an RNN and Computing activations in an RNN are concatenated to a combined matrix Computing activations in an RNN. Later on, we'll make use of this notation as well.

Computing the activations is very similar to standard multilayer perceptrons and other types of feedforward neural networks. For the hidden layer, the net input Computing activations in an RNN (preactivation) is computed through a linear combination. That is, we compute the sum of the multiplications of the weight matrices with the corresponding vectors and add the bias unit—Computing activations in an RNN. Then, the activations of the hidden units at the time step t are calculated as follows:

Computing activations in an RNN

Here, Computing activations in an RNN is the bias vector for the hidden units and Computing activations in an RNN is the activation function of the hidden layer.

In case you want to use the concatenated weight matrix Computing activations in an RNN, the formula for computing hidden units will change as follows:

Computing activations in an RNN

Once the activations of hidden units at the current time step are computed, then the activations of output units will be computed as follows:

Computing activations in an RNN

To help clarify this further, the following figure shows the process of computing these activations with both formulations:

Computing activations in an RNN

Note

Training RNNs using BPTT

The learning algorithm for RNNs was introduced in 1990s Backpropagation Through Time: What It Does and How to Do It (Paul Werbos, Proceedings of IEEE, 78(10):1550-1560, 1990).

The derivation of the gradients might be a bit complicated, but the basic idea is that the overall loss L is the sum of all the loss functions at times Computing activations in an RNN to Computing activations in an RNN:

Computing activations in an RNN

Since the loss at time Computing activations in an RNN is dependent on the hidden units at all previous time steps Computing activations in an RNN, the gradient will be computed as follows:

Computing activations in an RNN

Here, Computing activations in an RNN is computed as a multiplication of adjacent time steps:

Computing activations in an RNN

The challenges of learning long-range interactions

Backpropagation through time, or BPTT, which we briefly mentioned in the previous information box, introduces some new challenges.

Because of the multiplicative factor The challenges of learning long-range interactions in the computing gradients of a loss function, the so-called vanishing or exploding gradient problem arises. This problem is explained through the examples in the following figure, which shows an RNN with only one hidden unit for simplicity:

The challenges of learning long-range interactions

Basically, The challenges of learning long-range interactions has The challenges of learning long-range interactions multiplications; therefore, multiplying the w weight The challenges of learning long-range interactions times results in a factor—The challenges of learning long-range interactions. As a result, if The challenges of learning long-range interactions, this factor becomes very small when The challenges of learning long-range interactions is large. On the other hand, if the weight of the recurrent edge is The challenges of learning long-range interactions, then The challenges of learning long-range interactions becomes very large when The challenges of learning long-range interactions is large. Note that large The challenges of learning long-range interactions refers to long-range dependencies.

Intuitively, we can see that a naïve solution to avoid vanishing or exploding gradient can be accomplished by ensuring The challenges of learning long-range interactions. If you are interested and would like to investigate this in more detail, I encourage you to read On the difficulty of training recurrent neural networks by R. Pascanu, T. Mikolov, and Y. Bengio, 2012 (https://arxiv.org/pdf/1211.5063.pdf).

In practice, there are two solutions to this problem:

  • Truncated backpropagation through time (TBPTT)
  • Long short-term memory (LSTM)

TBPTT clips the gradients above a given threshold. While TBPTT can solve the exploding gradient problem, the truncation limits the number of steps that the gradient can effectively flow back and properly update the weights.

On the other hand, LSTM, designed in 1997 by Hochreiter and Schmidhuber, has been more successful in modeling long-range sequences by overcoming the vanishing gradient problem. Let's discuss LSTM in more detail.

LSTM units

LSTMs were first introduced to overcome the vanishing gradient problem (Long Short-Term Memory, S. Hochreiter and J. Schmidhuber, Neural Computation, 9(8): 1735-1780, 1997). The building block of an LSTM is a memory cell, which essentially represents the hidden layer.

In each memory cell, there is a recurrent edge that has the desirable weight LSTM units, as we discussed previously, to overcome the vanishing and exploding gradient problems. The values associated with this recurrent edge is called cell state. The unfolded structure of a modern LSTM cell is shown in the following figure:

LSTM units

Notice that the cell state from the previous time step, LSTM units, is modified to get the cell state at the current time step, LSTM units, without being multiplied directly with any weight factor.

The flow of information in this memory cell is controlled by some units of computation that we'll describe here. In the previous figure, LSTM units refers to the element-wise product (element-wise multiplication) and LSTM units means element-wise summation (element-wise addition). Furthermore, LSTM units refers to the input data at time t, and LSTM units indicates the hidden units at time LSTM units.

Four boxes are indicated with an activation function, either the sigmoid function (LSTM units) or hyperbolic tangent (tanh), and a set of weights; these boxes apply linear combination by performing matrix-vector multiplications on their input. These units of computation with sigmoid activation functions, whose output units are passed through LSTM units, are called gates.

In an LSTM cell, there are three different types of gates, known as the forget gate, the input gate, and the output gate:

  • The forget gate (LSTM units) allows the memory cell to reset the cell state without growing indefinitely. In fact, the forget gate decides which information is allowed to go through and which information to suppress. Now, LSTM units is computed as follows:
    LSTM units
    Note that the forget gate was not part of the original LSTM cell; it was added a few years later to improve the original model (Learning to Forget: Continual Prediction with LSTM, F. Gers, J. Schmidhuber, and F. Cummins, Neural Computation 12, 2451-2471, 2000).
  • The input gate (LSTM units) and input node (LSTM units) are responsible for updating the cell state. They are computed as follows:
    LSTM units
    LSTM units
    The cell state at time t is computed as follows:
    LSTM units
  • The output gate (LSTM units) decides how to update the values of hidden units:
    LSTM units

Given this, the hidden units at the current time step are computed as follows:

LSTM units

The structure of an LSTM cell and its underlying computations might seem too complex. However, the good news is that TensorFlow has already implemented everything in wrapper functions that allows us to define our LSTM cells easily. We'll see the real application of LSTMs in action when we use TensorFlow later in this chapter.

Note

We have introduced LSTMs in this section, which provide a basic approach for modeling long-range dependencies in sequences. Yet, it is important to note that there are many variations of LSTMs described in literature (An Empirical Exploration of Recurrent Network Architectures, Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever, Proceedings of ICML, 2342-2350, 2015).

Also, worth noting is a more recent approach, called Gated Recurrent Unit (GRU), which was proposed in 2014. GRUs have a simpler architecture than LSTMs; therefore, they are computationally more efficient while their performance in some tasks, such as polyphonic music modeling, is comparable to LSTMs. If you are interested in learning more about these modern RNN architectures, refer to the paper, Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling by Junyoung Chung and others 2014 (https://arxiv.org/pdf/1412.3555v1.pdf).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset