Project two – implementing an RNN for character-level language modeling in TensorFlow

Language modeling is a fascinating application that enables machines to perform human-language-related tasks, such as generating English sentences. One of the interesting efforts in this area is the work done by Sutskever, Martens, and Hinton (Generating Text with Recurrent Neural Networks, Ilya Sutskever, James Martens, and Geoffrey E. Hinton, Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011

In the model that we'll build now, the input is a text document, and our goal is to develop a model that can generate new text similar to the input document. Examples of such an input can be a book or a computer program in a specific programming language.

In character-level language modeling, the input is broken down into a sequence of characters that are fed into our network one character at a time. The network will process each new character in conjunction with the memory of the previously seen characters to predict the next character. The following figure shows an example of character-level language modeling:

We can break this implementation down into three separate steps—preparing the data, building the RNN model, and performing next-character prediction and sampling to generate new text.

If you recall from the previous sections of this chapter, we mentioned the exploding gradient problem. In this application, we'll also get a chance to play with a gradient clipping technique to avoid this exploding gradient problem.

Preparing the data

In this section, we prepare the data for character-level language modeling.

To get the input data, visit the Project Gutenberg website at, which provides thousands of free e-books. For our example, we can get the book The Tragedie of Hamlet by William Shakespeare in plain text format from

Note that this link will directly take you to the download page. If you are using macOS or a Linux operating system, you can download the file with the following command in the Terminal:

curl > pg2265.txt

If this resource becomes unavailable in future, a copy of this text is also included in this chapter's code directory in the book's code repository at

Once we have some data, we can read it into a Python session as plain text. In the following code, the Python variable chars represents the set of unique characters observed in this text. We then create a dictionary that maps each character to an integer, char2int, and a dictionary that performs reverse mapping, for instance, mapping integers to those unique characters—int2char. Using the char2int dictionary, we convert the text into a NumPy array of integers. The following figure shows an example of converting characters into integers and the reverse for the words "Hello" and "world":

This code reads the text from the downloaded link, removes the beginning portion of the text that contains some legal description of the Gutenberg project, and then constructs the dictionaries based on the text:

>>> import numpy as np
>>> ## Reading and processing text
>>> with open('pg2265.txt', 'r', encoding='utf-8') as f:
>>> text = text[15858:]
>>> chars = set(text)
>>> char2int = {ch:i for i,ch in enumerate(chars)}
>>> int2char = dict(enumerate(chars))
>>> text_ints = np.array([char2int[ch] for ch in text],
...                      dtype=np.int32)

Now, we should reshape the data into batches of sequences, the most important step in preparing data. As we know, the goal is to predict the next character based on the sequence of characters that we have observed so far. Therefore, we shift the input (x) and output (y) of the neural network by one character. The following figure shows the preprocessing steps, starting from a text corpus to generating data arrays for x and y:

Preparing the data

As you can see in this figure, the training arrays x and y have the same shapes or dimensions, where the number of rows is equal to the batch size and the number of columns is Preparing the data.

Given the input array data that contains the integers that correspond to the characters in the text corpus, the following function will generate x and y with the same structure shown in the previous figure:

>>> def reshape_data(sequence, batch_size, num_steps):
...     mini_batch_length = batch_size * num_steps
...     num_batches = int(len(sequence) / mini_batch_length)
...     if num_batches*mini_batch_length + 1 > len(sequence):
...         num_batches = num_batches - 1
...     ## Truncate the sequence at the end to get rid of
...     ## remaining charcaters that do not make a full batch
...     x = sequence[0: num_batches*mini_batch_length]
...     y = sequence[1: num_batches*mini_batch_length + 1]
...     ## Split x & y into a list batches of sequences:
...     x_batch_splits = np.split(x, batch_size)
...     y_batch_splits = np.split(y, batch_size)
...     ## Stack the batches together
...     ## batch_size x mini_batch_length
...     x = np.stack(x_batch_splits)
...     y = np.stack(y_batch_splits)
...     return x, y

The next step is to split the arrays x and y into mini-batches where each row is a sequence with length equal to the number of steps. The process of splitting the data array x is shown in the following figure:

Preparing the data

In the following code, we define a function named create_batch_generator that splits the data arrays x and y, as shown in the previous figure, and outputs a batch generator. Later, we will use this generator to iterate through the mini-batches during the training of our network:

>>> def create_batch_generator(data_x, data_y, num_steps):
...     batch_size, tot_batch_length = data_x.shape
...     num_batches = int(tot_batch_length/num_steps)
...     for b in range(num_batches):
...         yield (data_x[:, b*num_steps:(b+1)*num_steps],
...                data_y[:, b*num_steps:(b+1)*num_steps])

At this point, we've now completed the data preprocessing steps, and we have the data in the proper format. In the next section, we'll implement the RNN model for character-level language modeling.

Building a character-level RNN model

To build a character-level neural network, we'll implement a class called CharRNN that constructs the graph of the RNN in order to predict the next character, after observing a given sequence of characters. From the classification perspective, the number of classes is the total number of unique characters that exists in the text corpus. The CharRNN class has four methods, as follows:

  • A constructor that sets up the learning parameters, creates a computation graph, and calls the build method to construct the graph based on the sampling mode versus the training mode.
  • A build method that defines the placeholders for feeding the data, constructs the RNN using LSTM cells, and defines the output of the network, the cost function, and the optimizer.
  • A train method to iterate through the mini-batches and train the network for the specified number of epochs.
  • A sample method to start from a given string, calculate the probabilities for the next character, and choose a character randomly according to these probabilities. This process will be repeated, and the sampled characters will be concatenated together to form a string. Once the size of this string reaches the specified length, it will return the string.

We'll break these four methods into separate code sections and explain each one. Note that implementing the RNN part of this model is very similar to the implementation in the Project one – performing sentiment analysis of IMDb movie reviews using multilayer RNNs section. So, we'll skip the description of building the RNN components here.

The constructor

In contrast to our previous implementation for sentiment analysis, where the same computation graph was used for both training and prediction modes, this time our computation graph is going to be different for the training versus the sampling mode.

Therefore we need to add a new Boolean type argument to the constructor, to determine whether we're building the model for the training mode or the sampling mode. The following code shows the implementation of the constructor enclosed in the class definition:

import tensorflow as tf
import os

class CharRNN(object):
    def __init__(self, num_classes, batch_size=64,
                 num_steps=100, lstm_size=128,
                 num_layers=1, learning_rate=0.001,
                 keep_prob=0.5, grad_clip=5,
        self.num_classes = num_classes
        self.batch_size = batch_size
        self.num_steps = num_steps
        self.lstm_size = lstm_size
        self.num_layers = num_layers
        self.learning_rate = learning_rate
        self.keep_prob = keep_prob
        self.grad_clip = grad_clip
        self.g = tf.Graph()
        with self.g.as_default():


            self.saver = tf.train.Saver()

            self.init_op = tf.global_variables_initializer()

As we planned earlier, the Boolean sampling argument is used to determine whether the instance of CharRNN is for building the graph in the training mode (sampling=False) or the sampling mode (sampling=True).

In addition to the sampling argument, we've introduced a new argument called grad_clip, which is used for clipping the gradients to avoid the exploding gradient problem that we mentioned earlier.

Then, similar to the previous implementation, the constructor creates a computation graph, sets the graph-level random seed for consistent output, and builds the graph by calling the build method.

The build method

The next method of the CharRNN class is build, which is very similar to the build method in the Project one – performing sentiment analysis of IMDb movie reviews using multilayer RNNs section, except for some minor differences. The build method first defines two local variables, batch_size and num_steps, based on the mode, as follows:

The build method

Recall that in the sentiment analysis implementation, we used an embedding layer to create a salient representation for the unique words in the dataset. In contrast, here we are using the one-hot encoding scheme for both x and y with depth=num_classes, where num_classes is in fact the total number of characters in the text corpus.

Building a multilayer RNN component of the model is exactly the same as in our sentiment analysis implementation, using the tf.nn.dynamic_rnn function. However, outputs from the tf.nn.dynamic_rnn function is a three-dimensional tensor with this shape—batch_size, num_steps, lstm_size. Next, this tensor will be reshaped into a two-dimensional tensor with the batch_size*num_steps, lstm_size shape, which is passed to the tf.layers.dense function to make a fully connected layer and obtain logits (net inputs). Finally, the probabilities for the next batch of characters are obtained and the cost function is defined. In addition, here, we apply gradient clipping using the tf.clip_by_global_norm function to avoid the exploding gradient problem.

The following code shows the implementation of what we've just described for our new build method:

    def build(self, sampling):
        if sampling == True:
            batch_size, num_steps = 1, 1
            batch_size = self.batch_size
            num_steps = self.num_steps

        tf_x = tf.placeholder(tf.int32,
                              shape=[batch_size, num_steps],
        tf_y = tf.placeholder(tf.int32,
                              shape=[batch_size, num_steps],
        tf_keepprob = tf.placeholder(tf.float32,

        # One-hot encoding:
        x_onehot = tf.one_hot(tf_x, depth=self.num_classes)
        y_onehot = tf.one_hot(tf_y, depth=self.num_classes)

        ### Build the multi-layer RNN cells
        cells = tf.contrib.rnn.MultiRNNCell(
            for _ in range(self.num_layers)])
        ## Define the initial state
        self.initial_state = cells.zero_state(
                    batch_size, tf.float32)

        ## Run each sequence step through the RNN
        lstm_outputs, self.final_state = tf.nn.dynamic_rnn(
                    cells, x_onehot,
        print('  << lstm_outputs  >>', lstm_outputs)

        seq_output_reshaped = tf.reshape(
                    shape=[-1, self.lstm_size],

        logits = tf.layers.dense(

        proba = tf.nn.softmax(

        y_reshaped = tf.reshape(
                    shape=[-1, self.num_classes],
        cost = tf.reduce_mean(

        # Gradient clipping to avoid "exploding gradients"
        tvars = tf.trainable_variables()
        grads, _ = tf.clip_by_global_norm(
                    tf.gradients(cost, tvars),
        optimizer = tf.train.AdamOptimizer(self.learning_rate)
        train_op = optimizer.apply_gradients(
                    zip(grads, tvars),

The train method

The next method of the CharRNN class is the train method, which is very similar to the train method described in the Project one – performing sentiment analysis of IMDb movie reviews using multilayer RNNs section. Here is the train method code, which will look very familiar to the sentiment analysis version we built earlier in this chapter:

    def train(self, train_x, train_y,
              num_epochs, ckpt_dir='./model/'):
        ## Create the checkpoint directory
        ## if it does not exists
        if not os.path.exists(ckpt_dir):
        with tf.Session(graph=self.g) as sess:

            n_batches = int(train_x.shape[1]/self.num_steps)
            iterations = n_batches * num_epochs
            for epoch in range(num_epochs):

                # Train network
                new_state =
                loss = 0
                ## Mini-batch generator:
                bgen = create_batch_generator(
                        train_x, train_y, self.num_steps)
                for b, (batch_x, batch_y) in enumerate(bgen, 1):
                    iteration = epoch*n_batches + b
                    feed = {'tf_x:0': batch_x,
                            'tf_y:0': batch_y,
                            'tf_keepprob:0' : self.keep_prob,
                            self.initial_state : new_state}
                    batch_cost, _, new_state =
                            ['cost:0', 'train_op',
                    if iteration % 10 == 0:
                        print('Epoch %d/%d Iteration %d'
                              '| Training loss: %.4f' % (
                              epoch + 1, num_epochs,
                              iteration, batch_cost))

                ## Save the trained model
                        sess, os.path.join(
                            ckpt_dir, 'language_modeling.ckpt'))

The sample method

The final method in our CharRNN class is the sample method. The behavior of this sample method is similar to that of the predict method that we implemented in the Project one – performing sentiment analysis of IMDb movie reviews using multilayer RNNs section. However, the difference here is that we calculate the probabilities for the next character from an observed sequence—observed_seq. Then, these probabilities are passed to a function named get_top_char, which randomly selects one character according to the obtained probabilities.

Initially, the observed sequence starts from starter_seq, which is provided as an argument. When new characters are sampled according to their predicted probabilities, they are appended to the observed sequence, and the new observed sequence is used for predicting the next character.

The implementation of the sample method is as follows:

    def sample(self, output_length,
               ckpt_dir, starter_seq="The "):
        observed_seq = [ch for ch in starter_seq]
        with tf.Session(graph=self.g) as sess:
            ## 1: run the model using the starter sequence
            new_state =
            for ch in starter_seq:
                x = np.zeros((1, 1))
                x[0, 0] = char2int[ch]
                feed = {'tf_x:0': x,
                        'tf_keepprob:0': 1.0,
                        self.initial_state: new_state}
                proba, new_state =
                        ['probabilities:0', self.final_state],

            ch_id = get_top_char(proba, len(chars))
            ## 2: run the model using the updated observed_seq
            for i in range(output_length):
                x[0,0] = ch_id
                feed = {'tf_x:0': x,
                        'tf_keepprob:0': 1.0,
                        self.initial_state: new_state}
                proba, new_state =
                        ['probabilities:0', self.final_state],

                ch_id = get_top_char(proba, len(chars))

        return ''.join(observed_seq)

So here, the sample method calls the get_top_char function to choose a character ID randomly (ch_id) according to the obtained probabilities.

In this get_top_char function, the probabilities are first sorted, then the top_n probabilities are passed to the numpy.random.choice function to randomly select one out of these top probabilities. The implementation of the get_top_char function is as follows:

def get_top_char(probas, char_size, top_n=5):
    p = np.squeeze(probas)
    p[np.argsort(p)[:-top_n]] = 0.0
    p = p / np.sum(p)
    ch_id = np.random.choice(char_size, 1, p=p)[0]
    return ch_id

Note, of course, that this function should be defined before the definition of the CharRNN class; we've explained it in this order here so that we can explain the concepts in order. Browse through the code notebook that accompanies this chapter to get a better overview of the order in which the functions are defined.

Creating and training the CharRNN Model

Now we're ready to create an instance of the CharRNN class to build the RNN model, and to train it with the following configurations:

>>> batch_size = 64
>>> num_steps = 100
>>> train_x, train_y = reshape_data(text_ints,
...                                 batch_size,
...                                 num_steps)
>>> rnn = CharRNN(num_classes=len(chars), batch_size=batch_size)
>>> rnn.train(train_x, train_y,
...           num_epochs=100,
...           ckpt_dir='./model-100/')

The trained model will be saved in a directory called ./model-100/ so that we can reload it later for prediction or for continuing the training.

The CharRNN model in the sampling mode

Next up, we can create a new instance of the CharRNN class in the sampling mode by specifying that sampling=True. We'll call the sample method to load the saved model in the ./model-100/ folder, and generate a sequence of 500 characters:

>>> del rnn
>>> np.random.seed(123)
>>> rnn = CharRNN(len(chars), sampling=True)
>>> print(rnn.sample(ckpt_dir='./model-100/',
...                  output_length=500))

The generated text will look like the following:

The CharRNN model in the sampling mode

You can see that in the resulting output, that some English words are mostly preserved. It's also important to note that this is from an old English text; therefore, some words in the original text may be unfamiliar. To get a better result, we would need to train the model for higher number of epochs. Feel free to repeat this with a much larger document and train the model for more epochs.

