Project one – performing sentiment analysis of IMDb movie reviews using multilayer RNNs

You may recall from Chapter 8, Applying Machine Learning to Sentiment Analysis, that sentiment analysis is concerned with analyzing the expressed opinion of a sentence or a text document. In this section and the following subsections, we will implement a multilayer RNN for sentiment analysis using a many-to-one architecture.

In the next section, we will implement a many-to-many RNN for an application language modeling. While the chosen examples are purposefully simple to introduce the main concepts of RNNs, language modeling has a wide range of interesting applications such as building chatbot — giving computers the ability to directly talk and interact with a human.

Preparing the data

In the preprocessing steps in Chapter 8, Applying Machine Learning to Sentiment Analysis, we created a clean dataset named movie_data.csv, which we'll use again now. So, first let's import the necessary modules and read the data into a DataFrame pandas, as follows:

>>> import pyprind
>>> import pandas as pd
>>> from string import punctuation
>>> import re
>>> import numpy as np
>>>
>>> df = pd.read_csv('movie_data.csv', encoding='utf-8')

Recall that this df data frame has two columns, namely 'review' and 'sentiment', where 'review' contains the text of movie reviews and 'sentiment' contains the 0 or 1 labels. The text component of these movie reviews are sequences of words; therefore, we want to build an RNN model to process the words in each sequence, and at the end, classify the entire sequence to 0 or 1 classes.

To prepare the data for input to a neural network, we need to encode it into numeric values. To do this, we first find the unique words in the entire dataset, which can be done using sets in Python. However, I found that using sets for finding unique words in such a large dataset is not efficient. A more efficient way is to use Counter from the collections package. If you want to learn more about Counter, refer to its documentation at https://docs.python.org/3/library/collections.html#collections.Counter.

In the following code, we will define a counts object from the Counter class that collects the counts of occurrence of each unique word in the text. Note that in this particular application (and in contrast to the bag-of-words model), we are only interested in the set of unique words and won't require the word counts, which are created as a side product.

Then, we create a mapping in the form of a dictionary that maps each unique word, in our dataset, to a unique integer number. We call this dictionary word_to_int, which can be used to convert the entire text of a review into a list of numbers. The unique words are sorted based on their counts, but any arbitrary order can be used without affecting the final results. This process of converting a text into a list of integers is performed using the following code:

>>> ## Preprocessing the data:
>>> ## Separate words and
>>> ## count each word's occurrence
>>>
>>> from collections import Counter

>>> counts = Counter()
>>> pbar = pyprind.ProgBar(len(df['review']), 
...                        title='Counting words occurrences')
>>> for i,review in enumerate(df['review']):
...     text = ''.join([c if c not in punctuation else ' '+c+' ' 
...                     for c in review]).lower()
...     df.loc[i,'review'] = text
...     pbar.update()
...     counts.update(text.split())
>>>
>>> ## Create a mapping
>>> ## Map each unique word to an integer
>>> word_counts = sorted(counts, key=counts.get, reverse=True)
>>> print(word_counts[:5])
>>> word_to_int = {word: ii for ii, word in 
...                enumerate(word_counts, 1)}
>>>
>>>
>>> mapped_reviews = []
>>> pbar = pyprind.ProgBar(len(df['review']), 
...                        title='Map reviews to ints')
>>> for review in df['review']:
...     mapped_reviews.append([word_to_int[word] 
...                           for word in review.split()])
...     pbar.update()

So far, we've converted sequences of words into sequences of integers. However, there is one issue that we still need to solve—the sequences currently have different lengths. In order to generate input data that is compatible with our RNN architecture, we will need to make sure that all the sequences have the same length.

For this purpose, we define a parameter called sequence_length that we set to 200. Sequences that have fewer than 200 words will be left-padded with zeros. Vice versa, sequences that are longer than 200 words are cut such that only the last 200 corresponding words will be used. We can implement this preprocessing step in two steps:

  1. Create a matrix of zeros, where each row corresponds to a sequence of size 200.
  2. Fill the index of words in each sequence from the right-hand side of the matrix. Thus, if a sequence has a length of 150, the first 50 elements of the corresponding row will stay zero.

These two steps are shown in the following figure, for a small example with eight sequences of sizes 4, 12, 8, 11, 7, 3, 10, and 13:

Preparing the data

Note that sequence_length is, in fact, a hyperparameter and can be tuned for optimal performance. Due to page limitations, we did not optimize this hyperparameter further, but we encourage you to try this with different values for sequence_length, such as 50, 100, 200, 250, and 300.

Check out the following code for the implementation of these steps to create sequences of the same length:

>>> ## Define same-length sequences
>>> ## if sequence length < 200: left-pad with zeros
>>> ## if sequence length > 200: use the last 200 elements
>>>
>>> sequence_length = 200  ## (Known as T in our RNN formulas)
>>> sequences = np.zeros((len(mapped_reviews), sequence_length),
...                       dtype=int)
>>>
>>> for i, row in enumerate(mapped_reviews):
...     review_arr = np.array(row)
...     sequences[i, -len(row):] = review_arr[-sequence_length:]

After we preprocess the dataset, we can proceed with splitting the data into separate training and test sets. Since the dataset was already shuffled, we can simply take the first half of the dataset for training and the second half for testing, as follows:

>>> X_train = sequences[:25000,:]
>>> y_train = df.loc[:25000, 'sentiment'].values
>>> X_test = sequences[25000:,:]
>>> y_test = df.loc[25000:, 'sentiment'].values

Now if we want to separate the dataset for cross-validation, we can further split the second half of the data further to generate a smaller test set and a validation set for hyperparameter optimization.

Finally, we define a helper function that breaks a given dataset (which could be a training set or test set) into chunks and returns a generator to iterate through these chunks (also known as mini-batches):

>>> np.random.seed(123) # for reproducibility

>>> ## Define a function to generate mini-batches:
>>> def create_batch_generator(x, y=None, batch_size=64):
...     n_batches = len(x)//batch_size
...     x = x[:n_batches*batch_size]
...     if y is not None:
...         y = y[:n_batches*batch_size]
...     for ii in range(0, len(x), batch_size):
...         if y is not None:
...             yield x[ii:ii+batch_size], y[ii:ii+batch_size]
...         else:
...             yield x[ii:ii+batch_size]

Using generators, as we've done in this code, is a very useful technique for handling memory limitations. This is the recommended approach for splitting the dataset into mini-batches for training a neural network, rather than creating all the data splits upfront and keeping them in memory during training.

Embedding

During the data preparation in the previous step, we generated sequences of the same length. The elements of these sequences were integer numbers that corresponded to the indices of unique words.

These word indices can be converted into input features in several different ways. One naïve way is to apply one-hot encoding to convert indices into vectors of zeros and ones. Then, each word will be mapped to a vector whose size is the number of unique words in the entire dataset. Given that the number of unique words (the size of the vocabulary) can be in the order of 20,000, which will also be the number of our input features, a model trained on such features may suffer from the curse of dimensionality. Furthermore, these features are very sparse, since all are zero except one.

A more elegant way is to map each word to a vector of fixed size with real-valued elements (not necessarily integers). In contrast to the one-hot encoded vectors, we can use finite-sized vectors to represent an infinite number of real numbers (in theory, we can extract infinite real numbers from a given interval, for example [-1, 1]).

This is the idea behind the so-called embedding, which is a feature-learning technique that we can utilize here to automatically learn the salient features to represent the words in our dataset. Given the number of unique words unique_words, we can choose the size of the embedding vectors to be much smaller than the number of unique words (embedding_size << unique_words) to represent the entire vocabulary as input features.

The advantages of embedding over one-hot encoding are as follows:

  • A reduction in the dimensionality of the feature space to decrease the effect of the curse of dimensionality
  • The extraction of salient features since the embedding layer in a neural network is trainable

The following schematic representation shows how embedding works by mapping vocabulary indices to a trainable embedding matrix:

Embedding

TensorFlow implements an efficient function, tf.nn.embedding_lookup, that maps each integer that corresponds to a unique word, to a row of this trainable matrix. For example, integer 1 is mapped to the first row, integer 2 is mapped to the second row, and so on. Then, given a sequence of integers, such as <0, 5, 3, 4, 19, 2…>, we need to look up the corresponding rows for each element of this sequence.

Now let's see how we can create an embedding layer in practice. If we have tf_x as the input layer where the corresponding vocabulary indices are fed with type tf.int32, then creating an embedding layer can be done in two steps, as follows:

  1. We start by creating a matrix of size Embedding as a tensor variable, which we call embedding, and we initialize its elements randomly with floats between [-1, 1]:
    embedding = tf.Variable(
                    tf.random_uniform(
                        shape=(n_words, embedding_size),
                        minval=-1, maxval=1)
                )
  2. Then, we use the tf.nn.embedding_lookup function to look up the row in the embedding matrix associated with each element of tf_x:
    embed_x = tf.nn.embedding_lookup(embedding, tf_x)

Note

As you may have observed in these steps, to create an embedding layer, the tf.nn.embedding_lookup function requires two arguments: the embedding tensor and the lookup IDs.

The tf.nn.embedding_lookup function has a few optional arguments that allow you to tweak the behavior of the embedding layer, such as applying L2 normalization. Feel free to read more about this function from its official documentation at https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup.

Building an RNN model

Now we're ready to build an RNN model. We'll implement a SentimentRNN class that has the following methods:

  • A constructor to set all the model parameters and then create a computation graph and call the self.build method to build the multilayer RNN model.
  • A build method that declares three placeholders for input data, input labels, and the keep-probability for the dropout configuration of the hidden layer. After declaring these, it creates an embedding layer, and builds the multilayer RNN using the embedded representation as input.
  • A train method that creates a TensorFlow session for launching the computation graph, iterates through the mini-batches of data, and runs for a fixed number of epochs, to minimize the cost function defined in the graph. This method also saves the model after 10 epochs for checkpointing.
  • A predict method that creates a new session, restores the last checkpoint saved during the training process, and carries out the predictions for the test data.

In the following code, we'll see the implementation of this class and its methods broken into separate code sections.

The SentimentRNN class constructor

Let's start with the constructor of our SentimentRNN class, which we'll code as follows:

import tensorflow as tf

class SentimentRNN(object):
    def __init__(self, n_words, seq_len=200,
                 lstm_size=256, num_layers=1, batch_size=64,
                 learning_rate=0.0001, embed_size=200):
        self.n_words = n_words
        self.seq_len = seq_len
        self.lstm_size = lstm_size  ## number of hidden units
        self.num_layers = num_layers
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.embed_size = embed_size

        self.g = tf.Graph()
        with self.g.as_default():
            tf.set_random_seed(123)
            self.build()
            self.saver = tf.train.Saver()
            self.init_op = tf.global_variables_initializer()

Here, the n_words parameter must be set equal to the number of unique words (plus 1 since we use zero to fill sequences whose size is less than 200) and it's used while creating the embedding layer along with the embed_size hyperparameter. Meanwhile, the seq_len variable must be set according to the length of the sequences that were created in the preprocessing steps we went through previously. Note that lstm_size is another hyperparameter that we've used here, and it determines the number of hidden units in each RNN layer.

The build method

Next, let's discuss the build method for our SentimentRNN class. This is the longest and most critical method in our sequence, so we'll be going through it in plenty of detail. First, we'll look at the code in full, so we can see everything together, and then we'll analyze each of its main parts:

    def build(self):
        ## Define the placeholders
        tf_x = tf.placeholder(tf.int32,
                    shape=(self.batch_size, self.seq_len),
                    name='tf_x')
        tf_y = tf.placeholder(tf.float32,
                    shape=(self.batch_size),
                    name='tf_y')
        tf_keepprob = tf.placeholder(tf.float32,
                    name='tf_keepprob')

        ## Create the embedding layer
        embedding = tf.Variable(
                    tf.random_uniform(
                        (self.n_words, self.embed_size),
                        minval=-1, maxval=1),
                    name='embedding')
        embed_x = tf.nn.embedding_lookup(
                    embedding, tf_x,
                    name='embeded_x')

        ## Define LSTM cell and stack them together
        cells = tf.contrib.rnn.MultiRNNCell(
                [tf.contrib.rnn.DropoutWrapper(
                   tf.contrib.rnn.BasicLSTMCell(self.lstm_size),
                   output_keep_prob=tf_keepprob)
                 for i in range(self.num_layers)])

        ## Define the initial state:
        self.initial_state = cells.zero_state(
                 self.batch_size, tf.float32)
        print('  << initial state >> ', self.initial_state)

        lstm_outputs, self.final_state = tf.nn.dynamic_rnn(
                 cells, embed_x,
                 initial_state=self.initial_state)

        ## Note: lstm_outputs shape:
        ##  [batch_size, max_time, cells.output_size]
        print('
  << lstm_output   >> ', lstm_outputs)
        print('
  << final state   >> ', self.final_state)

        logits = tf.layers.dense(
                 inputs=lstm_outputs[:, -1],
                 units=1, activation=None,
                 name='logits')
        
        logits = tf.squeeze(logits, name='logits_squeezed')
        print ('
  << logits        >> ', logits)
        
        y_proba = tf.nn.sigmoid(logits, name='probabilities')
        predictions = {
            'probabilities': y_proba,
            'labels' : tf.cast(tf.round(y_proba), tf.int32,
                 name='labels')
        }
        print('
  << predictions   >> ', predictions)

        ## Define the cost function
        cost = tf.reduce_mean(
                 tf.nn.sigmoid_cross_entropy_with_logits(
                 labels=tf_y, logits=logits),
                 name='cost')
        
        ## Define the optimizer
        optimizer = tf.train.AdamOptimizer(self.learning_rate)
        train_op = optimizer.minimize(cost, name='train_op')

So first of all in our build method here, we created three placeholders, namely tf_x, tf_y, and tf_keepprob, which we need for feeding the input data. Then we added the embedding layer, which builds the embedded representation embed_x, as we discussed earlier.

Next, in our build method, we built the RNN network with LSTM cells. We did this in three steps:

  1. First, we defined the multilayer RNN cells.
  2. Next, we defined the initial state for these cells.
  3. Finally, we created an RNN specified by the RNN cells and their initial states.

Let's break these three steps out in detail in the following three sections, so we can examine in depth how we built the RNN network in our build method.

Step 1 – defining multilayer RNN cells

To examine how we coded our build method to build the RNN network, the first step was to define our multilayer RNN cells.

Fortunately, TensorFlow has a very nice wrapper class to define LSTM cells—the BasicLSTMCell class—which can be stacked together to form a multilayer RNN using the MultiRNNCell wrapper class. The process of stacking RNN cells with a dropout has three nested steps; these three nested steps can be described from inside out as follows:

  1. First, create the RNN cells using tf.contrib.rnn.BasicLSTMCell.
  2. Apply the dropout to the RNN cells using tf.contrib.rnn.DropoutWrapper.
  3. Make a list of such cells according to the desired number of RNN layers and pass this list to tf.contrib.rnn.MultiRNNCell.

In our build method code, this list is created using Python list comprehension. Note that for a single layer, this list has only one cell.

Note

You can read more about these functions at the following links:

Step 2 – defining the initial states for the RNN cells

The second step that our build method takes to build the RNN network was to define the initial states for the RNN cells.

You'll recall from the architecture of LSTM cells, there are three types of inputs in an LSTM cell—input data Step 2 – defining the initial states for the RNN cells, activations of hidden units from the previous time step Step 2 – defining the initial states for the RNN cells, and the cell state from the previous time step Step 2 – defining the initial states for the RNN cells.

So, in our build method implementation, Step 2 – defining the initial states for the RNN cells is the embedded embed_x data tensor. However, when we evaluate the cells, we also need to specify the previous state of the cells. So, when we start processing a new input sequence, we initialize the cell states to zero state; then after each time step, we need to store the updated state of the cells to use for the next time step.

Once our multilayer RNN object is defined (cells in our implementation), we define its initial state in our build method using the cells.zero_state method.

Step 3 – creating the RNN using the RNN cells and their states

The third step to creating the RNN in our build method, used the tf.nn.dynamic_rnn function to pull together all our components.

The tf.nn.dynamic_rnn function therefore pulls the embedded data, the RNN cells, and their initial states, and creates a pipeline for them according to the unrolled architecture of LSTM cells.

The tf.nn.dynamic_rnn function returns a tuple containing the activations of the RNN cells, outputs; and their final states, state. The output is a three-dimensional tensor with this shape—(batch_size, num_steps, lstm_size). We pass outputs to a fully connected layer to get logits and we store the final state to use as the initial state of the next mini-batch of data.

Note

Feel free to read more about the tf.nn.dynamic_rnn function at its official documentation page at https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn.

Finally, in our build method, after setting up the RNN components of the network, the cost function and optimization schemes can be defined like any other neural network.

The train method

The next method in our SentimentRNN class is train. This method call is quite similar to the train methods we created in Chapter 14, Going Deeper – The Mechanics of TensorFlow and Chapter 15, Classifying Images with Deep Convolutional Neural Networks except that we have an additional tensor, state, that we feed into our network.

The following code shows the implementation of the train method:

    def train(self, X_train, y_train, num_epochs):
        with tf.Session(graph=self.g) as sess:
            sess.run(self.init_op)
            iteration = 1
            for epoch in range(num_epochs):
                state = sess.run(self.initial_state)
                
                for batch_x, batch_y in create_batch_generator(
                            X_train, y_train, self.batch_size):
                    feed = {'tf_x:0': batch_x,
                            'tf_y:0': batch_y,
                            'tf_keepprob:0': 0.5,
                            self.initial_state : state}
                    loss, _, state = sess.run(
                            ['cost:0', 'train_op',
                             self.final_state],
                             feed_dict=feed)

                    if iteration % 20 == 0:
                        print("Epoch: %d/%d Iteration: %d "
                              "| Train loss: %.5f" % (
                               epoch + 1, num_epochs,
                               iteration, loss))

                    iteration +=1
                if (epoch+1)%10 == 0:
                    self.saver.save(sess,
                        "model/sentiment-%d.ckpt" % epoch)

In this implementation of our train method, at the beginning of each epoch, we start from the zero states of RNN cells as our current state. Running each mini-batch of data is performed by feeding the current state along with the data batch_x and their labels batch_y. Upon finishing the execution of a mini-batch, we update the state to be the final state, which is returned by the tf.nn.dynamic_rnn function. This updated state will be used toward execution of the next mini-batch. This process is repeated and the current state is updated throughout the epoch.

The predict method

Finally, the last method in our SentimentRNN class is the predict method, which keeps updating the current state similar to the train method, shown in the following code:

    def predict(self, X_data, return_proba=False):
        preds = []
        with tf.Session(graph = self.g) as sess:
            self.saver.restore(
                sess, tf.train.latest_checkpoint('./model/'))
            test_state = sess.run(self.initial_state)
            for ii, batch_x in enumerate(
                create_batch_generator(
                    X_data, None, batch_size=self.batch_size), 1):
                feed = {'tf_x:0' : batch_x,
                        'tf_keepprob:0' : 1.0,
                        self.initial_state : test_state}
                if return_proba:
                    pred, test_state = sess.run(
                        ['probabilities:0', self.final_state],
                        feed_dict=feed)
                else:
                    pred, test_state = sess.run(
                        ['labels:0', self.final_state],
                        feed_dict=feed)
                    
                preds.append(pred)
                
        return np.concatenate(preds)

Instantiating the SentimentRNN class

We've now coded and examined all four parts of our SentimentRNN class, which were the class constructor, the build method, the train method, and the predict method.

We are now ready to create an object of the class SentimentRNN, with parameters as follows:

>>> n_words = max(list(word_to_int.values())) + 1
>>>
>>> rnn = SentimentRNN(n_words=n_words,
...                    seq_len=sequence_length,
...                    embed_size=256,
...                    lstm_size=128,
...                    num_layers=1,
...                    batch_size=100,
...                    learning_rate=0.001)

Notice here that we use num_layers=1 to use a single RNN layer. Although our implementation allows us to create multilayer RNNs, by setting num_layers greater than 1. Here we should consider the small size of our dataset, and that a single RNN layer may generalize better to unseen data, since it is less likely to overfit the training data.

Training and optimizing the sentiment analysis RNN model

Next, we can train the RNN model by calling the rnn.train function. In the following code, we train the model for 40 epochs using the input from X_train and the corresponding class labels stored in y_train:

>>> rnn.train(X_train, y_train, num_epochs=40)
Epoch: 1/40 Iteration: 20 | Train loss: 0.70637
Epoch: 1/40 Iteration: 40 | Train loss: 0.60539
Epoch: 1/40 Iteration: 60 | Train loss: 0.66977
Epoch: 1/40 Iteration: 80 | Train loss: 0.51997
...

The trained model is saved using TensorFlow's checkpointing system, which we discussed in Chapter 14, Going Deeper – The Mechanics of TensorFlow. Now, we can use the trained model for predicting the class labels on the test set, as follows:

>>> preds = rnn.predict(X_test)
>>> y_true = y_test[:len(preds)]
>>> print('Test Acc.: %.3f' % (
...     np.sum(preds == y_true) / len(y_true)))

The result will show an accuracy of 86 percent. Given the small size of this dataset, this is comparable to the test prediction accuracy obtained in Chapter 8, Applying Machine Learning to Sentiment Analysis.

We can optimize this further by changing the hyperparameters of the model, such as lstm_size, seq_len, and embed_size, to achieve better generalization performance. However, for hyperparameter tuning, it is recommended that we create a separate validation set and that we don't repeatedly use the test set for evaluation to avoid introducing bias through test data leakage, which we discussed in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning.

Also, if you're interested in the prediction probabilities on the test set rather than the class labels, then you can set return_proba=True as follows:

>>> proba = rnn.predict(X_test, return_proba=True)

So this was our first RNN model for sentiment analysis. We'll now go further and create an RNN for character-by-character language modeling in TensorFlow, as another popular application of sequence modeling.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset