Language modeling is a fascinating application that enables machines to perform human-language-related tasks, such as generating English sentences. One of the interesting efforts in this area is the work done by Sutskever, Martens, and Hinton (Generating Text with Recurrent Neural Networks, Ilya Sutskever, James Martens, and Geoffrey E. Hinton, Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011 https://pdfs.semanticscholar.org/93c2/0e38c85b69fc2d2eb314b3c1217913f7db11.pdf).
In the model that we'll build now, the input is a text document, and our goal is to develop a model that can generate new text similar to the input document. Examples of such an input can be a book or a computer program in a specific programming language.
In character-level language modeling, the input is broken down into a sequence of characters that are fed into our network one character at a time. The network will process each new character in conjunction with the memory of the previously seen characters to predict the next character. The following figure shows an example of character-level language modeling:
We can break this implementation down into three separate steps—preparing the data, building the RNN model, and performing next-character prediction and sampling to generate new text.
If you recall from the previous sections of this chapter, we mentioned the exploding gradient problem. In this application, we'll also get a chance to play with a gradient clipping technique to avoid this exploding gradient problem.
In this section, we prepare the data for character-level language modeling.
To get the input data, visit the Project Gutenberg website at https://www.gutenberg.org/, which provides thousands of free e-books. For our example, we can get the book The Tragedie of Hamlet by William Shakespeare in plain text format from http://www.gutenberg.org/cache/epub/2265/pg2265.txt.
Note that this link will directly take you to the download page. If you are using macOS or a Linux operating system, you can download the file with the following command in the Terminal:
curl http://www.gutenberg.org/cache/epub/2265/pg2265.txt > pg2265.txt
If this resource becomes unavailable in future, a copy of this text is also included in this chapter's code directory in the book's code repository at https://github.com/rasbt/python-machine-learning-book-2nd-edition.
Once we have some data, we can read it into a Python session as plain text. In the following code, the Python variable chars
represents the set of unique characters observed in this text. We then create a dictionary that maps each character to an integer, char2int
, and a dictionary that performs reverse mapping, for instance, mapping integers to those unique characters—int2char
. Using the char2int
dictionary, we convert the text into a NumPy array of integers. The following figure shows an example of converting characters into integers and the reverse for the words "Hello"
and "world"
:
This code reads the text from the downloaded link, removes the beginning portion of the text that contains some legal description of the Gutenberg project, and then constructs the dictionaries based on the text:
>>> import numpy as np >>> ## Reading and processing text >>> with open('pg2265.txt', 'r', encoding='utf-8') as f: ... text=f.read() >>> text = text[15858:] >>> chars = set(text) >>> char2int = {ch:i for i,ch in enumerate(chars)} >>> int2char = dict(enumerate(chars)) >>> text_ints = np.array([char2int[ch] for ch in text], ... dtype=np.int32)
Now, we should reshape the data into batches of sequences, the most important step in preparing data. As we know, the goal is to predict the next character based on the sequence of characters that we have observed so far. Therefore, we shift the input (x) and output (y) of the neural network by one character. The following figure shows the preprocessing steps, starting from a text corpus to generating data arrays for x and y:
As you can see in this figure, the training arrays x and y have the same shapes or dimensions, where the number of rows is equal to the batch size and the number of columns is .
Given the input array data
that contains the integers that correspond to the characters in the text corpus, the following function will generate x
and y
with the same structure shown in the previous figure:
>>> def reshape_data(sequence, batch_size, num_steps): ... mini_batch_length = batch_size * num_steps ... num_batches = int(len(sequence) / mini_batch_length) ... if num_batches*mini_batch_length + 1 > len(sequence): ... num_batches = num_batches - 1 ... ## Truncate the sequence at the end to get rid of ... ## remaining charcaters that do not make a full batch ... x = sequence[0: num_batches*mini_batch_length] ... y = sequence[1: num_batches*mini_batch_length + 1] ... ## Split x & y into a list batches of sequences: ... x_batch_splits = np.split(x, batch_size) ... y_batch_splits = np.split(y, batch_size) ... ## Stack the batches together ... ## batch_size x mini_batch_length ... x = np.stack(x_batch_splits) ... y = np.stack(y_batch_splits) ... ... return x, y
The next step is to split the arrays x and y into mini-batches where each row is a sequence with length equal to the number of steps. The process of splitting the data array x is shown in the following figure:
In the following code, we define a function named create_batch_generator
that splits the data arrays x and y, as shown in the previous figure, and outputs a batch generator. Later, we will use this generator to iterate through the mini-batches during the training of our network:
>>> def create_batch_generator(data_x, data_y, num_steps): ... batch_size, tot_batch_length = data_x.shape ... num_batches = int(tot_batch_length/num_steps) ... for b in range(num_batches): ... yield (data_x[:, b*num_steps:(b+1)*num_steps], ... data_y[:, b*num_steps:(b+1)*num_steps])
At this point, we've now completed the data preprocessing steps, and we have the data in the proper format. In the next section, we'll implement the RNN model for character-level language modeling.
To build a character-level neural network, we'll implement a class called CharRNN
that constructs the graph of the RNN in order to predict the next character, after observing a given sequence of characters. From the classification perspective, the number of classes is the total number of unique characters that exists in the text corpus. The CharRNN
class has four methods, as follows:
build
method to construct the graph based on the sampling mode versus the training mode.build
method that defines the placeholders for feeding the data, constructs the RNN using LSTM cells, and defines the output of the network, the cost function, and the optimizer.train
method to iterate through the mini-batches and train the network for the specified number of epochs.sample
method to start from a given string, calculate the probabilities for the next character, and choose a character randomly according to these probabilities. This process will be repeated, and the sampled characters will be concatenated together to form a string. Once the size of this string reaches the specified length, it will return the string.We'll break these four methods into separate code sections and explain each one. Note that implementing the RNN part of this model is very similar to the implementation in the Project one – performing sentiment analysis of IMDb movie reviews using multilayer RNNs section. So, we'll skip the description of building the RNN components here.
In contrast to our previous implementation for sentiment analysis, where the same computation graph was used for both training and prediction modes, this time our computation graph is going to be different for the training versus the sampling mode.
Therefore we need to add a new Boolean type argument to the constructor, to determine whether we're building the model for the training mode or the sampling mode. The following code shows the implementation of the constructor enclosed in the class definition:
import tensorflow as tf import os class CharRNN(object): def __init__(self, num_classes, batch_size=64, num_steps=100, lstm_size=128, num_layers=1, learning_rate=0.001, keep_prob=0.5, grad_clip=5, sampling=False): self.num_classes = num_classes self.batch_size = batch_size self.num_steps = num_steps self.lstm_size = lstm_size self.num_layers = num_layers self.learning_rate = learning_rate self.keep_prob = keep_prob self.grad_clip = grad_clip self.g = tf.Graph() with self.g.as_default(): tf.set_random_seed(123) self.build(sampling=sampling) self.saver = tf.train.Saver() self.init_op = tf.global_variables_initializer()
As we planned earlier, the Boolean sampling
argument is used to determine whether the instance of CharRNN
is for building the graph in the training mode (sampling=False
) or the sampling mode (sampling=True
).
In addition to the sampling
argument, we've introduced a new argument called grad_clip
, which is used for clipping the gradients to avoid the exploding gradient problem that we mentioned earlier.
Then, similar to the previous implementation, the constructor creates a computation graph, sets the graph-level random seed for consistent output, and builds the graph by calling the build
method.
The next method of the CharRNN
class is build
, which is very similar to the build
method in the Project one – performing sentiment analysis of IMDb movie reviews using multilayer RNNs section, except for some minor differences. The build
method first defines two local variables, batch_size
and num_steps
, based on the mode, as follows:
Recall that in the sentiment analysis implementation, we used an embedding layer to create a salient representation for the unique words in the dataset. In contrast, here we are using the one-hot encoding scheme for both x and y with depth=num_classes
, where num_classes
is in fact the total number of characters in the text corpus.
Building a multilayer RNN component of the model is exactly the same as in our sentiment analysis implementation, using the tf.nn.dynamic_rnn
function. However, outputs
from the tf.nn.dynamic_rnn
function is a three-dimensional tensor with this shape—batch_size, num_steps, lstm_size
. Next, this tensor will be reshaped into a two-dimensional tensor with the batch_size*num_steps, lstm_size
shape, which is passed to the tf.layers.dense
function to make a fully connected layer and obtain logits
(net inputs). Finally, the probabilities for the next batch of characters are obtained and the cost function is defined. In addition, here, we apply gradient clipping using the tf.clip_by_global_norm
function to avoid the exploding gradient problem.
The following code shows the implementation of what we've just described for our new build
method:
def build(self, sampling): if sampling == True: batch_size, num_steps = 1, 1 else: batch_size = self.batch_size num_steps = self.num_steps tf_x = tf.placeholder(tf.int32, shape=[batch_size, num_steps], name='tf_x') tf_y = tf.placeholder(tf.int32, shape=[batch_size, num_steps], name='tf_y') tf_keepprob = tf.placeholder(tf.float32, name='tf_keepprob') # One-hot encoding: x_onehot = tf.one_hot(tf_x, depth=self.num_classes) y_onehot = tf.one_hot(tf_y, depth=self.num_classes) ### Build the multi-layer RNN cells cells = tf.contrib.rnn.MultiRNNCell( [tf.contrib.rnn.DropoutWrapper( tf.contrib.rnn.BasicLSTMCell(self.lstm_size), output_keep_prob=tf_keepprob) for _ in range(self.num_layers)]) ## Define the initial state self.initial_state = cells.zero_state( batch_size, tf.float32) ## Run each sequence step through the RNN lstm_outputs, self.final_state = tf.nn.dynamic_rnn( cells, x_onehot, initial_state=self.initial_state) print(' << lstm_outputs >>', lstm_outputs) seq_output_reshaped = tf.reshape( lstm_outputs, shape=[-1, self.lstm_size], name='seq_output_reshaped') logits = tf.layers.dense( inputs=seq_output_reshaped, units=self.num_classes, activation=None, name='logits') proba = tf.nn.softmax( logits, name='probabilities') y_reshaped = tf.reshape( y_onehot, shape=[-1, self.num_classes], name='y_reshaped') cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits( logits=logits, labels=y_reshaped), name='cost') # Gradient clipping to avoid "exploding gradients" tvars = tf.trainable_variables() grads, _ = tf.clip_by_global_norm( tf.gradients(cost, tvars), self.grad_clip) optimizer = tf.train.AdamOptimizer(self.learning_rate) train_op = optimizer.apply_gradients( zip(grads, tvars), name='train_op')
The next method of the CharRNN
class is the train
method, which is very similar to the train
method described in the Project one – performing sentiment analysis of IMDb movie reviews using multilayer RNNs section. Here is the train
method code, which will look very familiar to the sentiment analysis version we built earlier in this chapter:
def train(self, train_x, train_y, num_epochs, ckpt_dir='./model/'): ## Create the checkpoint directory ## if it does not exists if not os.path.exists(ckpt_dir): os.mkdir(ckpt_dir) with tf.Session(graph=self.g) as sess: sess.run(self.init_op) n_batches = int(train_x.shape[1]/self.num_steps) iterations = n_batches * num_epochs for epoch in range(num_epochs): # Train network new_state = sess.run(self.initial_state) loss = 0 ## Mini-batch generator: bgen = create_batch_generator( train_x, train_y, self.num_steps) for b, (batch_x, batch_y) in enumerate(bgen, 1): iteration = epoch*n_batches + b feed = {'tf_x:0': batch_x, 'tf_y:0': batch_y, 'tf_keepprob:0' : self.keep_prob, self.initial_state : new_state} batch_cost, _, new_state = sess.run( ['cost:0', 'train_op', self.final_state], feed_dict=feed) if iteration % 10 == 0: print('Epoch %d/%d Iteration %d' '| Training loss: %.4f' % ( epoch + 1, num_epochs, iteration, batch_cost)) ## Save the trained model self.saver.save( sess, os.path.join( ckpt_dir, 'language_modeling.ckpt'))
The final method in our CharRNN
class is the sample
method. The behavior of this sample
method is similar to that of the predict
method that we implemented in the Project one – performing sentiment analysis of IMDb movie reviews using multilayer RNNs section. However, the difference here is that we calculate the probabilities for the next character from an observed sequence—observed_seq
. Then, these probabilities are passed to a function named get_top_char
, which randomly selects one character according to the obtained probabilities.
Initially, the observed sequence starts from starter_seq
, which is provided as an argument. When new characters are sampled according to their predicted probabilities, they are appended to the observed sequence, and the new observed sequence is used for predicting the next character.
The implementation of the sample
method is as follows:
def sample(self, output_length, ckpt_dir, starter_seq="The "): observed_seq = [ch for ch in starter_seq] with tf.Session(graph=self.g) as sess: self.saver.restore( sess, tf.train.latest_checkpoint(ckpt_dir)) ## 1: run the model using the starter sequence new_state = sess.run(self.initial_state) for ch in starter_seq: x = np.zeros((1, 1)) x[0, 0] = char2int[ch] feed = {'tf_x:0': x, 'tf_keepprob:0': 1.0, self.initial_state: new_state} proba, new_state = sess.run( ['probabilities:0', self.final_state], feed_dict=feed) ch_id = get_top_char(proba, len(chars)) observed_seq.append(int2char[ch_id]) ## 2: run the model using the updated observed_seq for i in range(output_length): x[0,0] = ch_id feed = {'tf_x:0': x, 'tf_keepprob:0': 1.0, self.initial_state: new_state} proba, new_state = sess.run( ['probabilities:0', self.final_state], feed_dict=feed) ch_id = get_top_char(proba, len(chars)) observed_seq.append(int2char[ch_id]) return ''.join(observed_seq)
So here, the sample
method calls the get_top_char
function to choose a character ID randomly (ch_id
) according to the obtained probabilities.
In this get_top_char
function, the probabilities are first sorted, then the top_n
probabilities are passed to the numpy.random.choice
function to randomly select one out of these top probabilities. The implementation of the get_top_char
function is as follows:
def get_top_char(probas, char_size, top_n=5): p = np.squeeze(probas) p[np.argsort(p)[:-top_n]] = 0.0 p = p / np.sum(p) ch_id = np.random.choice(char_size, 1, p=p)[0] return ch_id
Note, of course, that this function should be defined before the definition of the CharRNN
class; we've explained it in this order here so that we can explain the concepts in order. Browse through the code notebook that accompanies this chapter to get a better overview of the order in which the functions are defined.
Now we're ready to create an instance of the CharRNN
class to build the RNN model, and to train it with the following configurations:
>>> batch_size = 64 >>> num_steps = 100 >>> train_x, train_y = reshape_data(text_ints, ... batch_size, ... num_steps) >>> >>> rnn = CharRNN(num_classes=len(chars), batch_size=batch_size) >>> rnn.train(train_x, train_y, ... num_epochs=100, ... ckpt_dir='./model-100/')
The trained model will be saved in a directory called ./model-100/
so that we can reload it later for prediction or for continuing the training.
Next up, we can create a new instance of the CharRNN
class in the sampling mode by specifying that sampling=True
. We'll call the sample
method to load the saved model in the ./model-100/
folder, and generate a sequence of 500 characters:
>>> del rnn >>> >>> np.random.seed(123) >>> rnn = CharRNN(len(chars), sampling=True) >>> print(rnn.sample(ckpt_dir='./model-100/', ... output_length=500))
The generated text will look like the following:
You can see that in the resulting output, that some English words are mostly preserved. It's also important to note that this is from an old English text; therefore, some words in the original text may be unfamiliar. To get a better result, we would need to train the model for higher number of epochs. Feel free to repeat this with a much larger document and train the model for more epochs.