In the previous chapters, we saw how we could leverage Artificial Neural Networks (ANNs) and Convolutional Neural Networks (CNNs) to mine patterns in text and apply them to various tasks such as classifying questions and sarcasm detection in news headlines. With ANNs, we primarily saw that inputs are independent of one another. With CNNs, we went one step further and tried to capture spatial relationships in the inputs by trying to extract patterns across a set of tokens together. However, our scope was limited to only a few tokens in the vicinity.

Sentences are essentially sequences of words, and the contextual meaning of a particular word in a sentence may not be derived solely from the immediately surrounding words. It might actually be a result of some words far away in the sentence as well. Also, the sense behind the usage of the word might be a result of a word or words in the past or in the future. In this chapter, we will look at Recurrent Neural Networks (RNNs) and improvements built on them to help us capture context and temporal relationships in sequences. In addition to discussing basic RNNs, we will also discuss their various use case-based forms and variants.

We will see how the Long Short-Term Memory (LSTM) cell, a memory-based variant of the RNN, helps us solve some issues pertaining to RNNs. An LSTM-based architecture will be used to generate text for a practical use case of generating descriptions. In this exercise, we will try to generate descriptions of hotels for the city of Mumbai, but the same concept can be extended to other similar problems such as music generation and lyrics generation, among other things. Finally, we will look at other variants of the memory-based RNN and discuss Gated Recurrent Units (GRUs) and stacked LSTM cells in brief.

The following topics will be covered in this chapter:

Baby steps toward understanding RNNs
Vanishing and exploding gradients
Architectural forms of RNNs
Giving memory to our networks—LSTMs
Building a text generator using LSTMs
Exploring memory-based variants of the RNN architecture

Now that the plot is set up, let's begin!

Technical requirements

The code files for this chapter can be found at the following GitHub link: https://github.com/PacktPublishing/Hands-On-Python-Natural-Language-Processing/tree/master/Chapter10.

Baby steps toward understanding RNNs

Sentences can be thought of as combinations of words, such that words are spoken over time in a sequential manner. It is essential to capture this temporal relationship in natural language data. The presence of a word in a lot of scenarios might be influenced by words not necessarily in the immediate neighborhood. Think of the following sentences:

She went on a walk along with her dog.

He went on a walk with his dog.

The sentences are exactly similar except in the usage of words for the identification of gender. The usage of the term her or his is directly dependent on the term She or He used toward the beginning of the sentence. With CNNs, we only looked at the immediate proximity of a word. Text data, as we saw in the examples, offers a unique challenge wherein we need to preserve context and have some notion of memory, which can help in making judgments at various points in time. RNNs are the go-to thing in such scenarios as they keep a notion of what happened in the past. Let's dig in and understand their structure in depth.

Every recurrent neuron takes in two inputs—one is the current or external input at that state and the other is called a hidden state, which is basically an output from the previous state. You may have noticed, this is in contrast to Feedforward Neural Networks (FNNs), wherein only the current input is taken into account when predicting anything. These inputs are independent of one another.

In an RNN, the output from a time step t depends on the input at time step t and the hidden state from the time step t-1. The following figure shows the structure of an RNN. It shows the input x_t going into the network at time step t and producing an output y_t for the corresponding time step. The interesting part is the feedback loop, which shows how the hidden state from the previous time step is also provided as input to the recurrent neuron in addition to the input x_t, as can be seen in the following figure:

In simple terms, the circle in the middle is basically an FNN, such that at each time step it outputs something based on the input to the network at that time step, along with the hidden state received from the previous time step. Let's look at the unrolled version of this so that things get clearer.

The following figure shows an unrolled version of an RNN wherein each rectangular block containing the circles is the neural network. Two outputs are emitted at every time step, one being the external output and the other being the hidden state, which is fed as input to the subsequent step. A many-to-many RNN is what's shown in the figure. It can be used for tasks such as music and lyrics generation, among other things. There can be multiple variations of RNNs, as we will see later. One thing to be careful about is that we should not think of these as n different neural networks. Instead, each of them is a snapshot of the same FNN with parameters shared across the time steps.

This can be illustrated as shown in the following figure:

While discussing CNNs, we took a window size, such that the network tried to find patterns among the word vectors for the tokens in each window by sliding over them at once. In contrast with RNNs, we would send across one token as input to the network at each time step. Let's take in the sentence She went on a walk along with her dog to understand this better.

The input to the RNN at time step 0 is the embedding for the word She. At time step 1, the input is the embedding of the word went along with the hidden state output from time step 0. As a result, the contextual information from the word She is captured in the hidden state, and it can be used when working with the word her at a later time step. This is illustrated in the following figure:

The following figure shows how the sentence would be processed by the RNN over time. In figure 2, we showed a many-to-many RNN, whereas, in figure 3, we have portrayed a many-to-one RNN, which takes in multiple inputs in the form of a sequence of words and provides one output at the last time step. An ideal use case for such an RNN would be a text classification problem where multiple tokens are used to predict the class label for a document.

Forward propagation in an RNN

Forward propagation is pretty straightforward in an RNN, whereby an input vector along with a hidden state vector is taken as input at each time step to produce an output that is further used as the hidden state for the next time step. There can be variations in terms of the output layer where the RNN can produce an output at each time step, as we saw in figure 2, or just the last time step, as we witnessed in figure 3.

Now that we have understood the basic structure of an RNN, let's next understand how it actually learns by backpropagating results through time in the next section.

Backpropagation through time in an RNN

One of the key concepts to understand in RNNs is the process of backpropagation through time (BPTT). We discussed backpropagation in detail in Chapter 8, From Human Neurons to Artificial Neurons for Text Understanding, where we saw that for each input, there is an output label based on which the algorithm computes the loss or error in a prediction. The error propagates back to the network and the parameters understand how much they were responsible for the error, and they tune themselves accordingly. There, we had one output for one input. However, as we have discussed, for RNNs each token is an input, and figure 3 shows that we need not have one output per token but a single output for a group of tokens, and, while forward propagating, we use snapshots of the network itself at various time steps. As a result, parameters are shared across the time steps.

How do we backpropagate in this scenario?

As we discussed, since the parameters are shared across the time steps, the gradient calculated at each of the time steps would not only be dependent on the computations of the present time step but also on the previous time steps. Essentially, this can be thought of as the same neurons firing differently across various points in time. At each time step, these neurons can be thought of as unrolling themselves one by one, and, finally, we reach the end state and get our output. The error calculated at the final step can be sent back the same way the network forward-propagated results at the various time steps. We can now see which neuron fired what at each time step, and this can be propagated back to the network in the same way as it's done for normal ANNs. One difference here would be that, as with normal ANNs, we go to the previous layer while backpropagating using the chain rule, but here, we go to the layer in the past time step since we are thinking of each unrolled version of the network we discussed previously as a different network altogether. While going back in time, we make use of the chain rule to do the math. At each step, the gradient with respect to the more recent time step is calculated. All these gradients and changes across each time step are aggregated. Since the weights are shared across time steps, we cannot apply the changes to the weights at each time step since the same weights may have produced different outputs for the changing inputs across time. Here, we try to backpropagate from the final time step to the initial time step, keeping track of the weight corrections at each time step, and, in the end, applying these aggregated changes all at once to the shared weights in our network, as illustrated in the following figure:

Why did we sum up the weight corrections at each time step and apply them all at once instead of making the corrections at each time step?

This is because, during the forward pass at each time step for an input, the weight was the same. If we computed the gradient at time step t and applied the changes to the weights there and then, the weights at time step t-1 would be different and the error calculation would be wrong since, during the forward pass, we had the same weights at every time step. If we had updated the weights at each time step, we would have simply penalized the weights while computing the gradient for something it did not do at all.

Sequences need not always be at the word level. Characters can be used as input sequences as well.

We saw how RNNs can help us perform a better analysis of sequential data and capture relationships over time, a case ideally suited to temporal dependencies such as usage of words in a sentence or time-series data, and so on. However, everything comes at a price, and, for RNNs, the problem is related to either vanishing or exploding gradients. Let's understand these in the next section.

Keras provides a SimpleRNN wrapper application programming interface (API) layer that helps us in building RNNs.

Vanishing and exploding gradients

Gradients help us to update weights in the right direction and at the right amount. What if these values become too high or too low?

The weights would not be updated correctly, the network would become unstable, and, consequently, our training of the network as a whole would fail.

The problem of vanishing and exploding gradients is seen predominantly in neural networks with a large number of hidden layers. When backpropagating in such neural networks, the error can become too large or too small whenever we compute the gradient, leading to instability in weight updates.

The exploding gradient problem occurs when large error gradients pile up and cause huge updates to the weights in our network. On the other hand, when the values of these gradients are too small, they effectively prevent the weights from getting updated in a network. This is called the vanishing gradient problem. Vanishing gradients can lead to the stopping of training altogether since the weights would not get updated.

We discussed that vanishing and exploding gradients can be troublesome when training neural networks with a lot of hidden layers. Now, imagine training an RNN wherein going back in each time step is like backpropagating the error to the previous layer in an ANN. Now, ANNs are generally a few layers deep. RNNs, on the other hand, can process sequences of sizes greater than 100 easily. As the error flows back in time, it can easily diminish or become huge. While going back in time, these gradients can take in vanishingly small or explodingly large values. The weight corrections for the time steps in the past can diminish when we encounter a vanishing gradient problem; this would feel as if the inputs in those time steps had no effect on the output at all. When encountering an exploding gradient problem, the gradients in the past time steps or for the initial inputs for the RNN can be very large and may subsequently lead to huge weight updates, causing instability in the model.

One technique for preventing the exploding gradient problem is called gradient clipping. As part of gradient clipping, the gradient is capped at a maximum value.

Vanishing and exploding gradients are very common problems in RNNs and there are ways to encounter these, as we will see when we discuss LSTMs.

Architectural forms of RNNs

In this section, we will begin by taking a look into what forms an RNN can take, depending on the application it is being built for. After that, we will dive into bidirectional RNNs, and, finally, we'll end this section by looking into how RNNs can be stacked to build deep RNNs.

Different flavors of RNN

RNNs can take multiple forms, depending on the type of use case it is applied to. Let's see the various forms an RNN can take, as follows:

One-to-one: This is the simplest form of RNN and is very similar to a traditional neural network, wherein the RNN takes in a single input and provides a single output. An example of a one-to-one RNN is shown in the following figure:

One-to-many: In a one-to-many RNN, the network takes in only one input and produces multiple outputs. Such an RNN is used for solving problems such as music generation, wherein music is generated on the input of a single musical note. An example of a one-to-many RNN is shown in the following figure:

Many-to-one: As the name suggests, this form of RNN takes in multiple inputs and produces one output. This can be used in applications such as sentiment analysis applications, wherein multiple words are fed into the network as input to produce an output depicting the sentiment from the input sentence. An example of a many-to-one RNN is shown in the following figure:

Many-to-many: These RNNs take in multiple inputs and produce multiple outputs. These RNNs can take two forms, depending on whether the size of the input is equal or not to the size of the output. Let's discuss the two forms depending on the variation in the sizes of the input and output, as follows:
Tx = Ty: This is the many-to-many form in which the size of the input is equal to the size of the output. A common use case for this is named entity recognition, where we try to classify each input token into entity groups such as person names, locations, organizations, and so on. An example of such an RNN is shown in the following figure:

Tx != Ty: In this form, the size of the input is not equal to the size of the output. Machine translation problems encountered when we try to convert one language to another is an example of such an RNN. Think of the string Goodbye in English. We need to convert it into German so as to produce the output, Auf Wiedersehen. The input is of size 1, whereas the output is of size 2. Essentially, these RNNs can produce an output string greater than or less than the size of the input string. An example of such an RNN is shown in the following figure:

We have learned about the various flavors that an RNN can take based on the application. In the next section, let's see whether RNNs can use information from the beginning as well as the end of some input data.

Carrying relationships both ways using bidirectional RNNs

The RNNs we have discussed so far carry relationships from the beginning to the end using a hidden state.

Is that all we need?

Let's look at the following two sentences:

The boy named Harry became the greatest wizard.

The boy named Harry became a Duke: the Duke of Sussex.

The first sentence talks about the fictional character Harry Potter created by author J.K. Rowling, whereas the second sentence talks about Prince Harry from the United Kingdom. Until we arrive at the word Harry, both the sentences are exactly the same: The boy namedHarry. Using a simple RNN, we cannot infer much about Harry from the words before its occurrence. Once we see the latter half of the sentence, we know who's being talked about: the wizard or the prince. It would be good if, using an RNN architecture, we could carry things from the end as well to infer things at a point in time. Bidirectional neural networks help us in this situation.

Bidirectional RNNs, as shown in the following figure, are essentially two independent RNNs such that one of them processes the inputs in the correct time order, whereas the other processes the inputs in the reverse time order. The outputs for these two networks are concatenated at every time step. This formation allows a network to have information from both directions at every time step. An example of such an RNN is shown in the following figure:

Bidirectional RNNs can be built by wrapping the SimpleRNN API from Keras into the bidirectional wrapper offered by Keras.

Before we begin discussing LSTMs, let's briefly talk about deep RNNs.

Going deep with RNNs

At times, it becomes essential to capture complex relationships in text that can be difficult to capture using a standard RNN. In such scenarios, we resort to stacking RNNs in order to capture the complex relationships. The following figure shows what a deep RNN looks like. The deep RNN shown has three hidden layers. The middle and outer layers do not receive input directly but, instead, they compute their activation outputs using the output at that time step from the previous hidden layer and the output of the previous time step in the same layer. Standard RNNs can be computationally expensive because of the notion of time steps. Deep RNNs take that one step further by stacking these RNNs on top of each other, and a deep RNN with three hidden layers can itself be highly expensive to compute. Also, instead of getting outputs (y^<1>, y^<2>,…, y^<n>) directly from the RNN, these RNN cell outputs can be fed to FNNs or other neural networks to get outputs from those instead of RNN cells.

An example of a deep RNN is shown in the following figure:

In this section, we looked at what an RNN is and how it uniquely helps capture sequential information and temporal relationships by combining previous outputs with present inputs. We looked at the various forms of RNN and also explored bidirectional RNNs, which help carry information from both directions. The major problem associated with RNNs is that they suffer in terms of capturing and making sense of long-term dependencies. Vanishing and exploding gradients can take a huge toll on the performance of such networks, as we discussed. In the next section, we will look into LSTM, which helps in overcoming the vanishing and exploding gradient problems by providing memory to our networks. Let's begin, then!

Giving memory to our networks – LSTMs

If the word in the eighth position in a sentence has a causal relationship with the word used in the first position, it becomes essential to remember this and apply it in the eighth position. However, RNNs are poor at capturing long-term dependencies because of the vanishing gradient problem, and for such use cases, it is important to remember these relationships. Along with remembering, we also need to understand what should be remembered from the past and what should be forgotten. An LSTM cell will help us with what we discussed here. LSTM cells help in remembering by using a structure called gates that help keep the necessary information in memory as long as it's required.

LSTM cells use the concept of state or memory to retain long-term dependencies. At every stage, it is decided as to what to keep in memory and what to discard. All this is done using gates. Let's look at the working of an LSTM cell in detail (this is shown in the following figure).

Understanding an LSTM cell

The input to an LSTM cell, as with RNNs, is a concatenation of the input for that time step and the output of the previous time step. These values are passed on to the gates in the LSTM cell, which are nothing but an FNN along with some form of activation function. These gates are referred to as the forget gate, input gate, and output gate. The neural networks in each of these gates get trained and allow the signal to flow through them into the memory in different amounts. They decide as to what information should be remembered, forgotten, or discarded at each step. An example of an LSTM cell is shown in the following figure:

Let's look at the workings of each of these gates individually.

Forget gate

The first juncture in an LSTM cell is the forget gate. The concatenated vector from the present state's input along with the previous state's output goes to the forget gate first. The forget gate's job is to decide how much of the information should be removed from memory.

Hey, hold on a second!

We wanted to remember things using LSTMs, and we are suddenly discarding things from memory.

Yes! That is absolutely right. It is as important to understand what should be forgotten as it is to understand what should be remembered. Think of the following example:

Leonardo is a good actor. He won at the Oscars. Brad is a good actor too.

Initially, our cell should remember that Leonardo is being talked about. However, as soon as we arrive in the third sentence, it should now remember that Brad is being talked about and it should discard information about Leonardo from its memory. Basically, our network should have the ability to forget long-term dependencies as soon as new dependencies worth remembering arrive in our data. Forget gates help us exactly with this by allowing space for new dependencies.

The forget gate is an FNN, as we mentioned, and the activation function applied here is sigmoid, which brings the output between 0 and 1, helping us figure out how much of the information must be forgotten. An output of 0 from this gate would indicate that we should forget everything from the past. On the contrary, an output of 1 indicates that the memory state should be retained.

The values from the forget gate are multiplied with the values in the memory cell in order to maintain only relevant information from the past.

We understood why forgetting is important and how the forget gate helps us with it. Now that we have understood how to forget, let's try to understand how to remember next.

Input gate

We should next understand what we need to remember and how much of it should be remembered. This is exactly what the input gate does for us.

Think of the following example:

Ronaldo is a good football player. Messi is another good player.

As soon as we arrive at the second sentence, the forget gate will help us forget about Ronaldo, but it is the job of the input gate to ensure that we now remember about Messi.

The input gate has two parts, which simultaneously help in figuring out what is to be remembered and how much of it needs to be remembered. Let's understand the functioning of the two parts next.

Part 1 in the input gate uses a sigmoid activation function, to pinpoint which part of the input values needs to be remembered by creating a sort of a mask with values between 0 and 1. A value of 0 would indicate that nothing is worth remembering from the inputs of this state, whereas a value of 1 would indicate everything from this input state must be remembered.

Part 2 uses a tanh activation function to help us figure out what is potentially the relevant information from the present state that the memory cell can get updated with. This part is also often referred to as the candidate vector since this vector holds the values that the memory cell might get updated with. The output ranges between -1 and 1 from this FNN.

An element-wise multiplication is performed between the outputs from part 1 and part 2. Essentially, what we did is we understood how relevant various components of part 2 are based on the values from part 1. The resultant output is added to the memory vector, thus updating the information in the memory cell.

Now that we have understood how to forget and remember, let's understand how to output next.

Output gate

The job of the output gate is to understand which bits of information in the current step should be sent across as output from the cell. With the forget gate and input gate, we always update our memory cell, but with the output gate, we will make use of our updated memory to see what information should be sent across as output from this LSTM cell.

There are two things that happen at this stage in the LSTM cell, as follows:

First, the output gate receives the input that was received by the LSTM cell initially, and these inputs are applied to the FNN in the output gate. Thereafter, the sigmoid activation function is applied to the computed values to bring the output in the range of 0 to 1.
Second, the memory at this juncture is already updated based on what should have been forgotten and what should have been remembered from the computations performed at the forget gate and input gate stages. This memory state is now passed through a tanh activation function at this stage to bring the values between -1 and 1.

Finally, the tanh-applied values from memory along with the sigmoid-applied values from the output gate are multiplied element-wise to get the final output from this LSTM cell in the network. This value can be taken as output and can also be sent across as the hidden state for the next LSTM time step.

Thus, we have sent across an output at this time step and also put forward the hidden state, which can be sent across to the next time step.

Backpropagation through time in LSTMs

The backpropagation in LSTMs works similarly to RNNs. However, unlike RNNs, we don't encounter the problem of vanishing or exploding gradients, wherein the gradients either become exceedingly small or large. It is primarily because of the memory component we introduced in LSTMs. The weights in the neural networks of each of the gates were used to update the memory cells. These weights get updated, using the various derivatives of the functions applied during the forward pass to update the memory cell. Consequently, the updates on these weights are only dependent on the state of the memory at the previous and present time steps.

We have had enough of theory. Now, let's try to solve an interesting problem of text generation using LSTMs.

Building a text generator using LSTMs

Text generation is a unique problem wherein, given some data, we should be able to predict the next occurring data. Good examples of where text generation is required include predicting the next word in our mobile phone keyboards, generating stories, music, and lyrics and so on. Let's try to build a model that can generate text related to describing hotels for the city of Mumbai, as follows:

We will begin by importing the various libraries we will be using during the course of solving this problem, as follows:

import nltk
from nltk.corpus import stopwords
import pandas as pd
import numpy as np
import re
from keras.preprocessing.sequence import pad_sequences
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout, Embedding

Now that we have loaded our libraries, let's load our dataset. For this exercise, we will use the Hotels on MakeMyTrip dataset, obtained from https://data.world/promptcloud/hotels-on-makemytrip-com. Run the following code:

data = pd.read_csv('Dataset/hotel_data.csv')

Let's try to see how our data looks, using the head() command offered by the pandas library, as follows:

data.head(5)

Here are a few rows of our data:

Let's see information on how many hotels per city are available in our dataset, using the following command:

data.city.value_counts()

Here's the output:

NewDelhiAndNCR        1163
Goa                   1122
Mumbai                 543
Jaipur                 534
Bangalore              512
                      ... 
Gajraula                 1
Chamba Uttaranchal       1
Krishnanagar             1
Nagarholae               1
Bijapur                  1
Name: city, Length: 770, dtype: int64

A substantial amount of hotel data from the city of Mumbai in India is available in this dataset. Let's concentrate on generating descriptions for Mumbai hotels.

As discussed, let's focus on data for Mumbai, as follows:

array = ['Mumbai']
data = data.loc[data['city'].isin(array)]

You can add in more cities to your array in order to use data from them as well.

Let's see whether we were able to filter out data for Mumbai, as follows:

data.head(5)

Here's the output—we were able to filter out data for Mumbai, as illustrated in the following figure:

Since we are interested in generating hotel descriptions, we will only keep the hotel_overview column, since others will not be required in our analysis. We will also follow that up by removing descriptions that are empty. The following code block helps us with this:

data = data.hotel_overview
data = data.dropna()

We now need to preprocess our data and, as part of preprocessing, we need to perform case-folding (converting to lowercase, stopword removal, and keeping only alphabetic data). Also, we will not keep single-character words. The following code block will help us do that:

stop = set(stopwords.words('english'))
def stopwords_removal(data_point):
    data = [x for x in data_point.split() if x not in stop]
    return data

Here's our method for overall data cleansing:

def clean_data(data):
    cleaned_data = []
    all_unique_words_in_each_description = []
    for entry in data:
        entry = re.sub(pattern='[^a-zA-Z]',repl=' ',string = entry)
        entry = re.sub(r'w{0,1}', repl=' ',string = entry)
        entry = entry.lower()
        entry = stopwords_removal(entry)
        cleaned_data.append(entry)
        unique = list(set(entry))
        all_unique_words_in_each_description.extend(unique)
    return cleaned_data, all_unique_words_in_each_description

Let's figure out the unique words in our data. This will basically be our vocabulary. We can do this using the following code block:

def unique_words(data):
    unique_words = set(all_unique_words_in_each_description)
    return unique_words, len(unique_words)

Apply the cleansing and unique word-finding methods we described on our data, as follows:

cleaned_data, all_unique_words_in_each_description = 
    clean_data(data)
unique_words, length_of_unique_words = 
    unique_words(all_unique_words_in_each_description)

We now have the following outcome:

The cleaned_data parameter contains our preprocessed data.
The unique_words parameter contains our list of unique words.
The length_of_unique_words parameter is the number of unique words in the data.

Let's look at one cleaned entry from our dataset and also figure out the number of unique words, as follows:

cleaned_data[0]

Here's a cleaned output block:

['nestled',
 'mumbai',
 'city',
 'strong',
 'historical',
 'links',
 'wonderful',
 'british',
 'architecture',
 'museums',
 'beaches',
 'places',...

Now, let's see the total number of unique words we have, as follows:

length_of_unique_words

Here is the number of unique words in our data:

Next, we need to build a mapping of words to an index and a reverse mapping from an index to a word, which will help us give out the word given by an index and vice versa, as follows:

def build_indices(unique_words):
    word_to_idx = {}
    idx_to_word = {}
    for i, word in enumerate(unique_words):
        word_to_idx[word] = i
        idx_to_word[i] = word
    return word_to_idx, idx_to_word

Now, let's build our indices using the following code block, which calls the method defined in the previous code block:

word_to_idx, idx_to_word = build_indices(unique_words)

The next step is to prepare our training corpus. As part of this, let's see what we aim to do, given the following excerpt from a sentence:

nestled mumbai city

The sequences of training data we generate from this three-word sentence would be the following:

nestled, mumbai
nestled,mumbai, city

We essentially have generated continuous sequences of a size greater than 1 from the sentence. This is followed by converting the words into their index values, which we build in the last step, as follows:

def prepare_corpus(corpus, word_to_idx):
    sequences = []
    for line in corpus:
        tokens = line
        for i in range(1, len(tokens)):
            i_gram_sequence = tokens[:i+1]
            i_gram_sequence_ids = []
            for j, token in enumerate(i_gram_sequence):
                i_gram_sequence_ids.append(word_to_idx[token])
            sequences.append(i_gram_sequence_ids)
    return sequences

Let's call the defined prepare_corpus method next, as follows:

sequences = prepare_corpus(cleaned_data, word_to_idx)
max_sequence_len = max([len(x) for x in sequences])

Here, we have the following outcome:

The sequences parameter contains all the sequences from our data.
The max_sequence_len parameter conveys the length of the maximum sequence size that was built based on our data.

Let's validate what we built just now, as follows:

print(sequences[0])
print(sequences[1])

We get the following output:

[1647, 867]
[1647, 867, 1452]

Let's see which words are mapped to these indices, using the following code block:

print(idx_to_word[1647])
print(idx_to_word[867])
print(idx_to_word[1452])

Here's the output:

nestled
mumbai
city

So, we have correctly built our sequences.

Next, let's figure out some metadata about the sequences built, as follows:

len(sequences)

The total number of sequences we have is the following:

Now, we will see the size of the longest sequence we have, as follows:

max_sequence_len

Here's the output:

Now that we have built our sequences, how do we use those to build a text generator?

Let's answer that in this step. What we will do is try and predict the last entry in our sequence, using the rest of the entries from the sequence.

The last entry in the sequences we generated becomes our class or dependent variable, and the entries prior to that become our independent variable. We will build a model that can predict one single value based on the input value of some length.

Let's see our example again.

The first sequence was this:

nestled, mumbai

Here, we would have nestled as our independent variable and mumbai as our dependent variable.

Similarly, for the second sequence, we have the following:

nestled, mumbai, city

nestled, mumbai forms our independent variable or X, and city is our dependent variable or Y.

Also, since our input size should be consistent for all training samples, we will pad our data to make this the same size. The size of each training sample after padding would be equal to the size of the longest sequence, which we captured in the max_sequence_len parameter in step 18. Here's the code for splitting our data into independent and dependent variables and also for padding the input samples:

Define build_input_data, as follows:

def build_input_data(sequences, max_sequence_len, 
                     length_of_unique_words):
    sequences = np.array(pad_sequences(sequences, 
                    maxlen = max_sequence_len, padding = 'pre'))
    X = sequences[:,:-1]
    y = sequences[:,-1]
    y = np_utils.to_categorical(y, length_of_unique_words)
    return X, y

Let's call our build_input_data method defined in the previous code block next, as follows:

X, y = build_input_data(sequences, max_sequence_len, length_of_unique_words)

Now, we are ready with our data, so let's go ahead and define and build our model next, as follows:

def create_model(max_sequence_len, length_of_unique_words):
    model = Sequential()
    model.add(Embedding(length_of_unique_words, 10, 
                        input_length=max_sequence_len - 1))
    model.add(LSTM(128))
    model.add(Dropout(0.2))
    model.add(Dense(length_of_unique_words, activation='softmax'))
    model.compile(loss='categorical_crossentropy', 
                  optimizer='adam')
    return model

Let's bring our model into existence, using the following code block:

model = create_model(max_sequence_len, length_of_unique_words)
model.summary()

Here's the summary of our model:

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 307, 10)           33950     
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               71168     
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 3395)              437955    
=================================================================
Total params: 543,073
Trainable params: 543,073
Non-trainable params: 0

Here are the components in our model:

The Embedding layer, which provides us embeddings for each training sample in our data. The parameters are as follows:
length_of_unique_words tells the model the size of our vocabulary.
10 indicates that we want a Dense embedding of size 10 as output from our model.

(input_length=max_sequence_len - 1) indicates that each training sample to the layer would have a size of max_sequence_len - 1.
The Embedding layer is followed by the LSTM layer, where we define 128 as the dimensionality of the inner cells in the LSTM layer.
Next, we randomly drop off 20% of neurons from the network using the Dropout layer.
Finally using the Dense layer from Keras, we define our output layer, where the number of neurons is equal to the size of our length_of_unique_words vocabulary. We have translated this problem into a multi-class classification problem, and so the softmax activation function is used.
Finally, we have calculated our loss using the categorical_crossentropy technique and used adam for optimization.

All these values and techniques are hyperparameters that can be tuned to obtain other results. We can, in fact, try adding more LSTM layers, adding more units to each layer, among others methods.

Next, we will train our model, as follows:

model.fit(X, y, batch_size = 512, epochs=100)

We have used the following:

A batch size of 512
A number of epochs of 100

These are, again, hyperparameters that can be tuned, as follows:

Epoch 1/100
51836/51836 [==============================] - 157s 3ms/step - loss: 6.9315
Epoch 2/100
51836/51836 [==============================] - 152s 3ms/step - loss: 6.5816
Epoch 3/100
51836/51836 [==============================] - 156s 3ms/step - loss: 6.5273
Epoch 4/100
51836/51836 [==============================] - 159s 3ms/step - loss: 6.4325
Epoch 5/100
51836/51836 [==============================] - 157s 3ms/step - loss: 6.2997
Epoch 6/100
51836/51836 [==============================] - 157s 3ms/step - loss: 6.2009

This can be viewed in its entirety in the code files of this book.

Now that we have trained our model, let's put it to the test and see how it works.

The following code block helps us to generate the next_words number of words based on the input we provide to the method:

def generate_text(seed_text, next_words, model, max_seq_len):
    for _ in range(next_words):
        cleaned_data = clean_data([seed_text])
        sequences= prepare_corpus(cleaned_data[0], word_to_idx)
        sequences = pad_sequences([sequences[-1]], maxlen=max_seq_len-1, 
                                  padding='pre')
        predicted = model.predict_classes(sequences, verbose=0)
        output_word = ''
        output_word = idx_to_word[predicted[0]]
        seed_text = seed_text + " " + output_word
    return seed_text.title()

Let's try the method we defined to generate some text, as follows:

print(generate_text("in Mumbai there we need", 30, model, max_sequence_len))

Here's our generated text:

In Mumbai There We Need Located Mumbai City Mumbai Charismatic Electrifying Open Hearted Mumbai Bombay City Dreamers Stalwarts Common Man Guests Visit Majestic Places Like Gateway India Chhatrapati Shivaji International Airport Km Chhatrapati Shivaji International

Let's try for another input, as follows:

print(generate_text("The beauty of the city", 30, model, max_sequence_len))

Here's our generated text:

The Beauty Of The City World Pilgrimage Employment Opportunities Park Km Chhatrapati Shivaji International Airport Km Chhatrapati Shivaji International Airport Km Vile Parle Railway Station Km Kamgar Hospital Bus Stand Prominent Tourist Spots Like Tikuji

We can see that the generated text captures a lot of meaningful information and is in line with the initial text we provided it as input. It does a decent job.

Hyperparameter tuning, along with building more complex and larger models, can help in generating better results.

Now that we have generated some beautiful text using LSTMs, let's go ahead and look at some other memory-based variants built on the foundation of RNNs.

Exploring memory-based variants of the RNN architecture

Before we close this chapter, we will briefly look at GRUs and stacked LSTMs.

GRUs

As we saw, LSTMs are huge networks and they have a lot of parameters. Consequently, we need to update a lot of parameters that are highly computationally expensive. Can we do better?

Yes! GRUs can help us with it.

GRUs use only two gates instead of three, as we used in LSTMs. They combine the forget gate and the candidate-choice part in the input gate into one gate, called the update gate. The other gate is the reset gate, which decides how the memory should get updated with the newly computed information. Based on the output of these two gates, it is decided what to send across as the output from this cell and how the hidden state is to be updated. This is done via using something called a content state, which holds the new information. As a result, the number of parameters in the network is drastically reduced.

You can read more about GRUs here: https://en.wikipedia.org/wiki/Gated_recurrent_unit.

Stacked LSTMs

Stacked LSTMs follow an architecture similar to deep RNNs, which we discussed earlier in this chapter. During the discussion on deep RNNs, we mentioned that stacking RNN layers one above the other helps the network capture highly complex patterns and relationships. The same idea is used when building stacked LSTMs, which can help us capture highly complex patterns from data. Each LSTM layer in a stacked LSTM model has its own gates and memory vector.

We saw that LSTMs can be highly computationally expensive because of the huge number of parameters involved. Stacked LSTMs take that forward as the number of parameters becomes even more dependent upon the number of LSTM layers involved. Hence, stacked LSTMs are very expensive in terms of computational requirements.

Summary

In this chapter, we began with understanding RNNs and how they enable us to capture sequential dependencies in data. We made an effort to understand the problem of the RNN in terms of it not being able to capture long-term dependencies because of vanishing and exploding gradient issues. We also looked at various forms an RNN can take, depending on the type of problem it is being used to solve. We followed that up with a brief discussion on some variants of RNNs by talking about bidirectional and deep RNNs. We went a step further next and looked at how the vanishing and exploding gradient problem can be solved by adding memory to the network and, as a result, we had an expansive discussion on LSTM, which is a variant of an RNN, using the concept of a memory state. We tried to solve the problem of text generation, where we used LSTMs to generate text for describing hotels in the city of Mumbai. Finally, we had a brief discussion on other memory variants of an RNN, including GRUs and stacked LSTMs.

We will take the knowledge from this chapter forward into the next chapter, where we look into sequence-to-sequence modeling using encoders and decoders. We will also discuss some of the state-of-the-art methodologies in Natural Language Processing (NLP) and talk about attention and transformers, among other topics.

Table of Contents for Capturing Temporal Relationships in Text

Create new playlist

Sign In

Sign Up

Technical requirements

Baby steps toward understanding RNNs

Forward propagation in an RNN

Backpropagation through time in an RNN

Vanishing and exploding gradients

Architectural forms of RNNs

Different flavors of RNN

Carrying relationships both ways using bidirectional RNNs

Going deep with RNNs

Giving memory to our networks – LSTMs

Understanding an LSTM cell

Forget gate

Input gate

Output gate

Backpropagation through time in LSTMs

Building a text generator using LSTMs

Exploring memory-based variants of the RNN architecture

GRUs

Stacked LSTMs

Summary

Table of Contents for
Capturing Temporal Relationships in Text