Working with temporal sequences

The last example in this chapter is about dealing with temporal sequences; more specifically, we will see how to deal with text, which is a variable-length sequence of words.

Some data-science algorithms deal with text using the bag-of-words approach; that is, they don't care where the words are and how they're placed in the text, they just care about their presence/absence (and maybe their frequency). Instead, a special class of deep networks is specifically designed to operate on sequences, where the order is important.

Some examples are as follows:

Predict a future stock price, given its historical data: In this case, the input is a sequence of numbers, and the output is a number
Predict whether the market will go up or down: In this case, given a sequence of numbers, we want to predict a class (up or down)
Translate an English text to French: In this case, the input sequence is converted into another sequence
Chatbot: In this case, the input and the output are both sequences (in the same language)

For this example, let's do something easy. We will try to detect the sentiment of a movie review. In this specific example, the input data is a sequence of words (and the order counts!), and the output is a binary label (which is the sentiment positive or negative).

Let's start importing the dataset. Fortunately, Keras already includes this dataset, and it's already pre-indexed; that is, each review is not composed of words but of indexes of a dictionary. Also, it's possible to select just the top words, and, with this code, we select a dictionary containing the top 25000 words:

In: from keras.datasets import imdb
    ((data_train, y_train), 
     (data_test, y_test)) = imdb.load_data(num_words=25000)

Let's see what's inside the data and the shape:

In: print(data_train.shape)
    print(data_train[0])
    print(len(data_train[0]))

Out: (25000,) 
     [1, 14, 22, 16, 43, 530, .......... 19, 178, 32]
     218

Firstly, there are 25000 reviews; that is, observations. Secondly, each review is composed of a sequence of numbers between 1 and 24,999; 1 indicates the start of the sequence, while the last number signals a word that is not in the dictionary. Note that each review has a different size; for example, the first one is 218 words in length.

It's now time to trim or pad all the sequences to a specific size. With Keras, this is easily done, and, for padding, the integer 0 is added:

In: from keras.preprocessing.sequence import pad_sequences
    X_train = pad_sequences(data_train, maxlen=100)
    X_test = pad_sequences(data_test, maxlen=100)

Our training matrix now has a rectangular shape. The first element after the trimming/padding operation becomes the following:

In: print(X_train[0])
    print(X_train[0].shape)

Out: [1415, .......... 19, 178, 32]
     (100,)

For this observation, just the last 100 words are maintained. Overall, now, all the observations have 100 dimensions. Let's now create a temporal deep model to predict the review sentiment.

The model proposed here has three layers:

An embedding layer. The original dictionary is set to 25,000 words, and the number of units composing the embedding (that is, the layer's output) is 256.
An LSTM layer. LSTM stands for long short-term memory, and it's one of the most powerful deep models for sequences. Thanks to its deep architecture, it's able to extract information from close and distant words in the sequence (hence the name). In this example, the number of cells is set to 256 (as the previous layer output dimension), with a dropout of 0.4 for regularization.
A dense layer with a sigmoid activation. That's what we need for a binary classifier.

Here's the code for doing so:

In: from keras.models import Sequential
    from keras.layers import LSTM, Dense
    from keras.layers.embeddings import Embedding
    from keras.optimizers import Adam
    model = Sequential()
    model.add(Embedding(25000, 256, input_length=100))
    model.add(LSTM(256, dropout=0.4, recurrent_dropout=0.4))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy',
    optimizer=Adam(),
    metrics=['accuracy'])

    model.fit(X_train, y_train,
              batch_size=64,
              epochs=10,
              validation_data=(X_test, y_test))

Out: Train on 25000 samples, validate on 25000 samples
     Epoch 1/10
     25000/25000 [==============================] - 139s 6ms/step - 
     loss:0.4923 - acc:0.7632 - val_loss:0.4246 - val_acc:0.8144
     Epoch 2/10
     25000/25000 [==============================] - 139s 6ms/step - 
     loss:0.3531 - acc:0.8525 - val_loss:0.4104 - val_acc: 0.8235
     Epoch 3/10
     25000/25000 [==============================] - 138s 6ms/step - 
     loss:0.2564 - acc:0.9000 - val_loss:0.3964 - val_acc: 0.8404
     ...
     Epoch 10/10
     25000/25000 [==============================] - 138s 6ms/step - 
     loss:0.0377 - acc:0.9878 - val_loss:0.8090 - val_acc:0.8230

And that's the accuracy on the 25K-review test dataset. That's an acceptable result since we achieved more than 80% of correct classifications with such a simple model. If you feel like improving it, you could try to make the architecture more sophisticated, but always keep in mind that, by increasing the complexity of the network, there is an increase in the time needed to train it and to predict the outcome, as well as the memory footprint.

Table of Contents for Working with temporal sequences

Create new playlist

Sign In

Sign Up

Table of Contents for
Working with temporal sequences