Implementation of the language model

In this section, we'll build a language model that operates over characters. For this implementation, we will use an Anna Karenina novel and see how the network will learn to implement the structure and style of the text:

Figure 10: General architecture for the character-level RNN (source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

This network is based off of Andrej Karpathy's post on RNNs (link: http://karpathy.github.io/2015/05/21/rnn-effectiveness/) and implementation in Torch (link: https://github.com/karpathy/char-rnn). Also, there's some information here at r2rt (link: http://r2rt.com/recurrent-neural-networks-in-tensorflow-ii.html) and from Sherjil Ozairp (link: https://github.com/sherjilozair/char-rnn-tensorflow) on GitHub. The following is the general architecture of the character-wise RNN.

We'll build a character-level RNN trained on the Anna Karenina novel (link: https://en.wikipedia.org/wiki/Anna_Karenina). It'll be able to generate new text based on the text from the book. You will find the .txt file included with the assets of this implementation.

Let’s start by importing the necessary libraries for this character-level implementation:

import numpy as np
import tensorflow as tf

from collections import namedtuple

To start off, we need to prepare the dataset by loading it and converting it in to integers. So, we will convert the characters into integers and then encode them as integers which makes it straightforward and easy to use as input variables for the model:

#reading the Anna Karenina novel text file
with open('Anna_Karenina.txt', 'r') as f:
    textlines=f.read()

#Building the vocan and encoding the characters as integers
language_vocab = set(textlines)
vocab_to_integer = {char: j for j, char in enumerate(language_vocab)}
integer_to_vocab = dict(enumerate(language_vocab))
encoded_vocab = np.array([vocab_to_integer[char] for char in textlines], dtype=np.int32)

So, let's have look at the first 200 characters from the Anna Karenina text:

textlines[:200]
Output:
"Chapter 1


Happy families are all alike; every unhappy family is unhappy in its own
way.

Everything was in confusion in the Oblonskys' house. The wife had
discovered that the husband was carrying on"

We have also converted the characters to a convenient form for the network, which is integers. So, let's have a look at the encoded version of the characters:

encoded_vocab[:200]
Output:
array([70, 34, 54, 29, 24, 19, 76, 45, 2, 79, 79, 79, 69, 54, 29, 29, 49,
       45, 66, 54, 39, 15, 44, 15, 19, 12, 45, 54, 76, 19, 45, 54, 44, 44,
      45, 54, 44, 15, 27, 19, 58, 45, 19, 30, 19, 76, 49, 45, 59, 56, 34,
       54, 29, 29, 49, 45, 66, 54, 39, 15, 44, 49, 45, 15, 12, 45, 59, 56,
       34, 54, 29, 29, 49, 45, 15, 56, 45, 15, 24, 12, 45, 11, 35, 56, 79,
       35, 54, 49, 53, 79, 79, 36, 30, 19, 76, 49, 24, 34, 15, 56, 16, 45,
       35, 54, 12, 45, 15, 56, 45, 31, 11, 56, 66, 59, 12, 15, 11, 56, 45,
       15, 56, 45, 24, 34, 19, 45, 1, 82, 44, 11, 56, 12, 27, 49, 12, 37,
       45, 34, 11, 59, 12, 19, 53, 45, 21, 34, 19, 45, 35, 15, 66, 19, 45,
       34, 54, 64, 79, 64, 15, 12, 31, 11, 30, 19, 76, 19, 64, 45, 24, 34,
       54, 24, 45, 24, 34, 19, 45, 34, 59, 12, 82, 54, 56, 64, 45, 35, 54,
       12, 45, 31, 54, 76, 76, 49, 15, 56, 16, 45, 11, 56], dtype=int32)

Since the network is working with individual characters, it's similar to a classification problem in which we are trying to predict the next character from the previous text. Here's how many classes our network has to pick from.

So, we will be feeding the model a character at a time, and the model will predict the next character by producing a probability distribution over the possible number of characters that could come next (vocab), which is equivalent to a number of classes the network needs to pick from:

len(language_vocab)
Output:
83

Since we'll be using stochastic gradient descent to train our model, we need to convert our data into training batches.

Table of Contents for Implementation of the language model

Create new playlist

Sign In

Sign Up

Table of Contents for
Implementation of the language model