How to do it...

Typical to the need for RNN, we will look at a given sequence of 10 words to predict the next possible word. For this exercise, we will take the Alice dataset to generate words, as follows (the code file is available as RNN_text_generation.ipynb in GitHub):

Import the relevant packages and dataset:

from keras.models import Sequential
from keras.layers import Dense,Activation
from keras.layers.recurrent import SimpleRNN
from keras.layers import LSTM
import numpy as np
fin=open('alice.txt',encoding='utf-8-sig')
lines=[]
for line in fin:
  line = line.strip().lower()
  if(len(line)==0):
    continue
  lines.append(line)
fin.close()
text = " ".join(lines)

A sample of the input text looks as follows:

Normalize the text to remove punctuations and convert it to lowercase:

import re
text = text.lower()
text = re.sub('[^0-9a-zA-Z]+',' ',text)

Assign the unique words to an index so that they can be referenced when constructing the training and test datasets:

from collections import Counter
counts = Counter()
counts.update(text.split())
words = sorted(counts, key=counts.get, reverse=True)
nb_words = len(text.split())
word2index = {word: i for i, word in enumerate(words)}
index2word = {i: word for i, word in enumerate(words)}

Construct the input set of words that leads to an output word. Note that we are considering a sequence of 10 words and trying to predict the 11^th word:

SEQLEN = 10
STEP = 1
input_words = []
label_words = []
text2=text.split()
for i in range(0,nb_words-SEQLEN,STEP):
     x=text2[i:(i+SEQLEN)]
     y=text2[i+SEQLEN]
     input_words.append(x)
     label_words.append(y)

A sample of the input_words and label_words lists is as follows:

Note that input_words is a list of lists and the output_words list is not.

Construct the vectors of the input and the output datasets:

total_words = len(set(words))
X = np.zeros((len(input_words), SEQLEN, total_words), dtype=np.bool)
y = np.zeros((len(input_words), total_words), dtype=np.bool)

We are creating empty arrays in the preceding step, which will be populated in the following code:

# Create encoded vectors for the input and output values
for i, input_word in enumerate(input_words):
     for j, word in enumerate(input_word):
         X[i, j, word2index[word]] = 1
     y[i,word2index[label_words[i]]]=1

In the preceding code, the first for loop is used to loop through all the words in the input sequence of words (10 words in input), and the second for loop is used to loop through an individual word in the chosen sequence of input words. Additionally, given that the output is a list, we do not need to update it using the second for loop (as there is no sequence of IDs). The output shapes of X and y are as follows:

Define the architecture of the model:

HIDDEN_SIZE = 128
BATCH_SIZE = 32
NUM_ITERATIONS = 100
NUM_EPOCHS_PER_ITERATION = 1
NUM_PREDS_PER_EPOCH = 100

model = Sequential()
model.add(LSTM(HIDDEN_SIZE,return_sequences=False,input_shape=(SEQLEN,total_words)))
model.add(Dense(total_words, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.summary()

A summary of the model is as follows:

Fit the model. Look at how the output varies over an increasing number of epochs. Generate a random set of sequences of 10 words and try to predict the next possible word. We are in a position to observe how our predictions are getting better over an increasing number of epochs:

for iteration in range(50):
     print("=" * 50)
     print("Iteration #: %d" % (iteration))
     model.fit(X, y, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS_PER_ITERATION, validation_split = 0.1)
     test_idx = np.random.randint(int(len(input_words)*0.1)) * (-1)
     test_words = input_words[test_idx]
     print("Generating from seed: %s" % (test_words))
     for i in range(NUM_PREDS_PER_EPOCH): 
         Xtest = np.zeros((1, SEQLEN, total_words))
         for i, ch in enumerate(test_words):
             Xtest[0, i, word2index[ch]] = 1
         pred = model.predict(Xtest, verbose=0)[0]
         ypred = index2word[np.argmax(pred)]
         print(ypred,end=' ')
         test_words = test_words[1:] + [ypred]

In the preceding code, we are fitting our model on input and output arrays for one epoch. Furthermore, we are choosing a random seed word (test_idx – which is a random number that is among the last 10% of the input array (as validation_split is 0.1) and are collecting the input words at a random location. We are converting the input sequence of IDs into a one-hot-encoded version (thus obtaining an array that is 1 x 10 x total_words in shape).

Finally, we make a prediction on the array we just created and obtain the word that has the highest probability. Let's look at the output in the first epoch and contrast that with output in the 25^th epoch:

Note that the output is always the in the first epoch. However, it becomes more reasonable as follows at the end of 50 epochs:

The Generating from seed line is the collection of predictions.

Note that while the training loss decreased over increasing epochs, the validation loss has become worse by the end of 50 epochs. This will improve as we train on more text and/or further fine-tune our model.

Additionally, this model could further be improved by using a bidirectional LSTM, which we will discuss in Sequence to Sequence learning chapter. The output of having a bidirectional LSTM is as follows:

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...