You may recall from Chapter 8, Applying Machine Learning to Sentiment Analysis, that sentiment analysis is concerned with analyzing the expressed opinion of a sentence or a text document. In this section and the following subsections, we will implement a multilayer RNN for sentiment analysis using a many-to-one architecture.
In the next section, we will implement a many-to-many RNN for an application language modeling. While the chosen examples are purposefully simple to introduce the main concepts of RNNs, language modeling has a wide range of interesting applications such as building chatbot — giving computers the ability to directly talk and interact with a human.
In the preprocessing steps in Chapter 8, Applying Machine Learning to Sentiment Analysis, we created a clean dataset named movie_data.csv
, which we'll use again now. So, first let's import the necessary modules and read the data into a DataFrame
pandas, as follows:
>>> import pyprind >>> import pandas as pd >>> from string import punctuation >>> import re >>> import numpy as np >>> >>> df = pd.read_csv('movie_data.csv', encoding='utf-8')
Recall that this df
data frame has two columns, namely 'review'
and 'sentiment'
, where 'review'
contains the text of movie reviews and 'sentiment'
contains the 0
or 1
labels. The text component of these movie reviews are sequences of words; therefore, we want to build an RNN model to process the words in each sequence, and at the end, classify the entire sequence to 0
or 1
classes.
To prepare the data for input to a neural network, we need to encode it into numeric values. To do this, we first find the unique words in the entire dataset, which can be done using sets in Python. However, I found that using sets for finding unique words in such a large dataset is not efficient. A more efficient way is to use Counter
from the collections package. If you want to learn more about Counter
, refer to its documentation at https://docs.python.org/3/library/collections.html#collections.Counter.
In the following code, we will define a counts
object from the Counter
class that collects the counts of occurrence of each unique word in the text. Note that in this particular application (and in contrast to the bag-of-words model), we are only interested in the set of unique words and won't require the word counts, which are created as a side product.
Then, we create a mapping in the form of a dictionary that maps each unique word, in our dataset, to a unique integer number. We call this dictionary word_to_int
, which can be used to convert the entire text of a review into a list of numbers. The unique words are sorted based on their counts, but any arbitrary order can be used without affecting the final results. This process of converting a text into a list of integers is performed using the following code:
>>> ## Preprocessing the data: >>> ## Separate words and >>> ## count each word's occurrence >>> >>> from collections import Counter >>> counts = Counter() >>> pbar = pyprind.ProgBar(len(df['review']), ... title='Counting words occurrences') >>> for i,review in enumerate(df['review']): ... text = ''.join([c if c not in punctuation else ' '+c+' ' ... for c in review]).lower() ... df.loc[i,'review'] = text ... pbar.update() ... counts.update(text.split()) >>> >>> ## Create a mapping >>> ## Map each unique word to an integer >>> word_counts = sorted(counts, key=counts.get, reverse=True) >>> print(word_counts[:5]) >>> word_to_int = {word: ii for ii, word in ... enumerate(word_counts, 1)} >>> >>> >>> mapped_reviews = [] >>> pbar = pyprind.ProgBar(len(df['review']), ... title='Map reviews to ints') >>> for review in df['review']: ... mapped_reviews.append([word_to_int[word] ... for word in review.split()]) ... pbar.update()
So far, we've converted sequences of words into sequences of integers. However, there is one issue that we still need to solve—the sequences currently have different lengths. In order to generate input data that is compatible with our RNN architecture, we will need to make sure that all the sequences have the same length.
For this purpose, we define a parameter called sequence_length
that we set to 200
. Sequences that have fewer than 200 words will be left-padded with zeros. Vice versa, sequences that are longer than 200 words are cut such that only the last 200 corresponding words will be used. We can implement this preprocessing step in two steps:
These two steps are shown in the following figure, for a small example with eight sequences of sizes 4, 12, 8, 11, 7, 3, 10, and 13:
Note that sequence_length
is, in fact, a hyperparameter and can be tuned for optimal performance. Due to page limitations, we did not optimize this hyperparameter further, but we encourage you to try this with different values for sequence_length
, such as 50, 100, 200, 250, and 300.
Check out the following code for the implementation of these steps to create sequences of the same length:
>>> ## Define same-length sequences >>> ## if sequence length < 200: left-pad with zeros >>> ## if sequence length > 200: use the last 200 elements >>> >>> sequence_length = 200 ## (Known as T in our RNN formulas) >>> sequences = np.zeros((len(mapped_reviews), sequence_length), ... dtype=int) >>> >>> for i, row in enumerate(mapped_reviews): ... review_arr = np.array(row) ... sequences[i, -len(row):] = review_arr[-sequence_length:]
After we preprocess the dataset, we can proceed with splitting the data into separate training and test sets. Since the dataset was already shuffled, we can simply take the first half of the dataset for training and the second half for testing, as follows:
>>> X_train = sequences[:25000,:] >>> y_train = df.loc[:25000, 'sentiment'].values >>> X_test = sequences[25000:,:] >>> y_test = df.loc[25000:, 'sentiment'].values
Now if we want to separate the dataset for cross-validation, we can further split the second half of the data further to generate a smaller test set and a validation set for hyperparameter optimization.
Finally, we define a helper function that breaks a given dataset (which could be a training set or test set) into chunks and returns a generator to iterate through these chunks (also known as mini-batches):
>>> np.random.seed(123) # for reproducibility >>> ## Define a function to generate mini-batches: >>> def create_batch_generator(x, y=None, batch_size=64): ... n_batches = len(x)//batch_size ... x = x[:n_batches*batch_size] ... if y is not None: ... y = y[:n_batches*batch_size] ... for ii in range(0, len(x), batch_size): ... if y is not None: ... yield x[ii:ii+batch_size], y[ii:ii+batch_size] ... else: ... yield x[ii:ii+batch_size]
Using generators, as we've done in this code, is a very useful technique for handling memory limitations. This is the recommended approach for splitting the dataset into mini-batches for training a neural network, rather than creating all the data splits upfront and keeping them in memory during training.
During the data preparation in the previous step, we generated sequences of the same length. The elements of these sequences were integer numbers that corresponded to the indices of unique words.
These word indices can be converted into input features in several different ways. One naïve way is to apply one-hot encoding to convert indices into vectors of zeros and ones. Then, each word will be mapped to a vector whose size is the number of unique words in the entire dataset. Given that the number of unique words (the size of the vocabulary) can be in the order of 20,000, which will also be the number of our input features, a model trained on such features may suffer from the curse of dimensionality. Furthermore, these features are very sparse, since all are zero except one.
A more elegant way is to map each word to a vector of fixed size with real-valued elements (not necessarily integers). In contrast to the one-hot encoded vectors, we can use finite-sized vectors to represent an infinite number of real numbers (in theory, we can extract infinite real numbers from a given interval, for example [-1, 1]).
This is the idea behind the so-called embedding, which is a feature-learning technique that we can utilize here to automatically learn the salient features to represent the words in our dataset. Given the number of unique words unique_words, we can choose the size of the embedding vectors to be much smaller than the number of unique words (embedding_size << unique_words) to represent the entire vocabulary as input features.
The advantages of embedding over one-hot encoding are as follows:
The following schematic representation shows how embedding works by mapping vocabulary indices to a trainable embedding matrix:
TensorFlow implements an efficient function, tf.nn.embedding_lookup
, that maps each integer that corresponds to a unique word, to a row of this trainable matrix. For example, integer 1 is mapped to the first row, integer 2 is mapped to the second row, and so on. Then, given a sequence of integers, such as <0, 5, 3, 4, 19, 2…>, we need to look up the corresponding rows for each element of this sequence.
Now let's see how we can create an embedding layer in practice. If we have tf_x
as the input layer where the corresponding vocabulary indices are fed with type tf.int32
, then creating an embedding layer can be done in two steps, as follows:
embedding
, and we initialize its elements randomly with floats between [-1, 1]:
embedding = tf.Variable( tf.random_uniform( shape=(n_words, embedding_size), minval=-1, maxval=1) )
tf.nn.embedding_lookup
function to look up the row in the embedding matrix associated with each element of tf_x
:embed_x = tf.nn.embedding_lookup(embedding, tf_x)
As you may have observed in these steps, to create an embedding layer, the tf.nn.embedding_lookup
function requires two arguments: the embedding tensor and the lookup IDs.
The tf.nn.embedding_lookup
function has a few optional arguments that allow you to tweak the behavior of the embedding layer, such as applying L2 normalization. Feel free to read more about this function from its official documentation at https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup.
Now we're ready to build an RNN model. We'll implement a SentimentRNN
class that has the following methods:
self.build
method to build the multilayer RNN model.build
method that declares three placeholders for input data, input labels, and the keep-probability for the dropout configuration of the hidden layer. After declaring these, it creates an embedding layer, and builds the multilayer RNN using the embedded representation as input.train
method that creates a TensorFlow session for launching the computation graph, iterates through the mini-batches of data, and runs for a fixed number of epochs, to minimize the cost function defined in the graph. This method also saves the model after 10 epochs for checkpointing.predict
method that creates a new session, restores the last checkpoint saved during the training process, and carries out the predictions for the test data.In the following code, we'll see the implementation of this class and its methods broken into separate code sections.
Let's start with the constructor of our SentimentRNN
class, which we'll code as follows:
import tensorflow as tf class SentimentRNN(object): def __init__(self, n_words, seq_len=200, lstm_size=256, num_layers=1, batch_size=64, learning_rate=0.0001, embed_size=200): self.n_words = n_words self.seq_len = seq_len self.lstm_size = lstm_size ## number of hidden units self.num_layers = num_layers self.batch_size = batch_size self.learning_rate = learning_rate self.embed_size = embed_size self.g = tf.Graph() with self.g.as_default(): tf.set_random_seed(123) self.build() self.saver = tf.train.Saver() self.init_op = tf.global_variables_initializer()
Here, the n_words
parameter must be set equal to the number of unique words (plus 1 since we use zero to fill sequences whose size is less than 200) and it's used while creating the embedding layer along with the embed_size
hyperparameter. Meanwhile, the seq_len
variable must be set according to the length of the sequences that were created in the preprocessing steps we went through previously. Note that lstm_size
is another hyperparameter that we've used here, and it determines the number of hidden units in each RNN layer.
Next, let's discuss the build
method for our SentimentRNN
class. This is the longest and most critical method in our sequence, so we'll be going through it in plenty of detail. First, we'll look at the code in full, so we can see everything together, and then we'll analyze each of its main parts:
def build(self): ## Define the placeholders tf_x = tf.placeholder(tf.int32, shape=(self.batch_size, self.seq_len), name='tf_x') tf_y = tf.placeholder(tf.float32, shape=(self.batch_size), name='tf_y') tf_keepprob = tf.placeholder(tf.float32, name='tf_keepprob') ## Create the embedding layer embedding = tf.Variable( tf.random_uniform( (self.n_words, self.embed_size), minval=-1, maxval=1), name='embedding') embed_x = tf.nn.embedding_lookup( embedding, tf_x, name='embeded_x') ## Define LSTM cell and stack them together cells = tf.contrib.rnn.MultiRNNCell( [tf.contrib.rnn.DropoutWrapper( tf.contrib.rnn.BasicLSTMCell(self.lstm_size), output_keep_prob=tf_keepprob) for i in range(self.num_layers)]) ## Define the initial state: self.initial_state = cells.zero_state( self.batch_size, tf.float32) print(' << initial state >> ', self.initial_state) lstm_outputs, self.final_state = tf.nn.dynamic_rnn( cells, embed_x, initial_state=self.initial_state) ## Note: lstm_outputs shape: ## [batch_size, max_time, cells.output_size] print(' << lstm_output >> ', lstm_outputs) print(' << final state >> ', self.final_state) logits = tf.layers.dense( inputs=lstm_outputs[:, -1], units=1, activation=None, name='logits') logits = tf.squeeze(logits, name='logits_squeezed') print (' << logits >> ', logits) y_proba = tf.nn.sigmoid(logits, name='probabilities') predictions = { 'probabilities': y_proba, 'labels' : tf.cast(tf.round(y_proba), tf.int32, name='labels') } print(' << predictions >> ', predictions) ## Define the cost function cost = tf.reduce_mean( tf.nn.sigmoid_cross_entropy_with_logits( labels=tf_y, logits=logits), name='cost') ## Define the optimizer optimizer = tf.train.AdamOptimizer(self.learning_rate) train_op = optimizer.minimize(cost, name='train_op')
So first of all in our build
method here, we created three placeholders, namely tf_x
, tf_y
, and tf_keepprob
, which we need for feeding the input data. Then we added the embedding layer, which builds the embedded representation embed_x
, as we discussed earlier.
Next, in our build
method, we built the RNN network with LSTM cells. We did this in three steps:
Let's break these three steps out in detail in the following three sections, so we can examine in depth how we built the RNN network in our build
method.
To examine how we coded our build
method to build the RNN network, the first step was to define our multilayer RNN cells.
Fortunately, TensorFlow has a very nice wrapper class to define LSTM cells—the BasicLSTMCell
class—which can be stacked together to form a multilayer RNN using the MultiRNNCell
wrapper class. The process of stacking RNN cells with a dropout has three nested steps; these three nested steps can be described from inside out as follows:
tf.contrib.rnn.BasicLSTMCell
.tf.contrib.rnn.DropoutWrapper
.tf.contrib.rnn.MultiRNNCell
.In our build
method code, this list is created using Python list comprehension. Note that for a single layer, this list has only one cell.
You can read more about these functions at the following links:
tf.contrib.rnn.BasicLSTMCell
:https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/BasicLSTMCelltf.contrib.rnn.DropoutWrapper
: https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/DropoutWrappertf.contrib.rnn.MultiRNNCell
: https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/MultiRNNCellThe second step that our build
method takes to build the RNN network was to define the initial states for the RNN cells.
You'll recall from the architecture of LSTM cells, there are three types of inputs in an LSTM cell—input data , activations of hidden units from the previous time step , and the cell state from the previous time step .
So, in our build
method implementation, is the embedded embed_x
data tensor. However, when we evaluate the cells
, we also need to specify the previous state of the cells. So, when we start processing a new input sequence, we initialize the cell states to zero state; then after each time step, we need to store the updated state of the cells to use for the next time step.
Once our multilayer RNN object is defined (cells
in our implementation), we define its initial state in our build
method using the cells.zero_state
method.
The third step to creating the RNN in our build
method, used the tf.nn.dynamic_rnn
function to pull together all our components.
The tf.nn.dynamic_rnn
function therefore pulls the embedded data, the RNN cells, and their initial states, and creates a pipeline for them according to the unrolled architecture of LSTM cells.
The tf.nn.dynamic_rnn
function returns a tuple containing the activations of the RNN cells, outputs
; and their final states, state
. The output is a three-dimensional tensor with this shape—(batch_size, num_steps, lstm_size)
. We pass outputs
to a fully connected layer to get logits
and we store the final state to use as the initial state of the next mini-batch of data.
Feel free to read more about the tf.nn.dynamic_rnn
function at its official documentation page at https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn.
Finally, in our build
method, after setting up the RNN components of the network, the cost function and optimization schemes can be defined like any other neural network.
The next method in our SentimentRNN
class is train
. This method call is quite similar to the train methods we created in Chapter 14, Going Deeper – The Mechanics of TensorFlow and Chapter 15, Classifying Images with Deep Convolutional Neural Networks except that we have an additional tensor, state
, that we feed into our network.
The following code shows the implementation of the train
method:
def train(self, X_train, y_train, num_epochs): with tf.Session(graph=self.g) as sess: sess.run(self.init_op) iteration = 1 for epoch in range(num_epochs): state = sess.run(self.initial_state) for batch_x, batch_y in create_batch_generator( X_train, y_train, self.batch_size): feed = {'tf_x:0': batch_x, 'tf_y:0': batch_y, 'tf_keepprob:0': 0.5, self.initial_state : state} loss, _, state = sess.run( ['cost:0', 'train_op', self.final_state], feed_dict=feed) if iteration % 20 == 0: print("Epoch: %d/%d Iteration: %d " "| Train loss: %.5f" % ( epoch + 1, num_epochs, iteration, loss)) iteration +=1 if (epoch+1)%10 == 0: self.saver.save(sess, "model/sentiment-%d.ckpt" % epoch)
In this implementation of our train
method, at the beginning of each epoch, we start from the zero states of RNN cells as our current state. Running each mini-batch of data is performed by feeding the current state along with the data batch_x
and their labels batch_y
. Upon finishing the execution of a mini-batch, we update the state to be the final state, which is returned by the tf.nn.dynamic_rnn
function. This updated state will be used toward execution of the next mini-batch. This process is repeated and the current state is updated throughout the epoch.
Finally, the last method in our SentimentRNN
class is the predict
method, which keeps updating the current state similar to the train
method, shown in the following code:
def predict(self, X_data, return_proba=False): preds = [] with tf.Session(graph = self.g) as sess: self.saver.restore( sess, tf.train.latest_checkpoint('./model/')) test_state = sess.run(self.initial_state) for ii, batch_x in enumerate( create_batch_generator( X_data, None, batch_size=self.batch_size), 1): feed = {'tf_x:0' : batch_x, 'tf_keepprob:0' : 1.0, self.initial_state : test_state} if return_proba: pred, test_state = sess.run( ['probabilities:0', self.final_state], feed_dict=feed) else: pred, test_state = sess.run( ['labels:0', self.final_state], feed_dict=feed) preds.append(pred) return np.concatenate(preds)
We've now coded and examined all four parts of our SentimentRNN
class, which were the class constructor, the build
method, the train
method, and the predict
method.
We are now ready to create an object of the class SentimentRNN
, with parameters as follows:
>>> n_words = max(list(word_to_int.values())) + 1 >>> >>> rnn = SentimentRNN(n_words=n_words, ... seq_len=sequence_length, ... embed_size=256, ... lstm_size=128, ... num_layers=1, ... batch_size=100, ... learning_rate=0.001)
Notice here that we use num_layers=1
to use a single RNN layer. Although our implementation allows us to create multilayer RNNs, by setting num_layers
greater than 1. Here we should consider the small size of our dataset, and that a single RNN layer may generalize better to unseen data, since it is less likely to overfit the training data.
Next, we can train the RNN model by calling the rnn.train
function. In the following code, we train the model for 40
epochs using the input from X_train
and the corresponding class labels stored in y_train
:
>>> rnn.train(X_train, y_train, num_epochs=40) Epoch: 1/40 Iteration: 20 | Train loss: 0.70637 Epoch: 1/40 Iteration: 40 | Train loss: 0.60539 Epoch: 1/40 Iteration: 60 | Train loss: 0.66977 Epoch: 1/40 Iteration: 80 | Train loss: 0.51997 ...
The trained model is saved using TensorFlow's checkpointing system, which we discussed in Chapter 14, Going Deeper – The Mechanics of TensorFlow. Now, we can use the trained model for predicting the class labels on the test set, as follows:
>>> preds = rnn.predict(X_test) >>> y_true = y_test[:len(preds)] >>> print('Test Acc.: %.3f' % ( ... np.sum(preds == y_true) / len(y_true)))
The result will show an accuracy of 86 percent. Given the small size of this dataset, this is comparable to the test prediction accuracy obtained in Chapter 8, Applying Machine Learning to Sentiment Analysis.
We can optimize this further by changing the hyperparameters of the model, such as lstm_size
, seq_len
, and embed_size
, to achieve better generalization performance. However, for hyperparameter tuning, it is recommended that we create a separate validation set and that we don't repeatedly use the test set for evaluation to avoid introducing bias through test data leakage, which we discussed in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning.
Also, if you're interested in the prediction probabilities on the test set rather than the class labels, then you can set return_proba=True
as follows:
>>> proba = rnn.predict(X_test, return_proba=True)
So this was our first RNN model for sentiment analysis. We'll now go further and create an RNN for character-by-character language modeling in TensorFlow, as another popular application of sequence modeling.