Character-level language models

Language modeling is an essential task for many applications such as speech recognition, machine translation and more. In this section, we'll try to mimic the training process of RNNs and get a deeper understanding of how these networks work. We'll build a language model that operate over characters. So, we will feed our network with a chunk of text with the purpose of trying to build a probability distribution of the next character given the previous ones which will allow us to generate text similar to the one we feed as an input in the training process.

For example, suppose we have a language with only four letters as its vocabulary, helo.

The task is to train a recurrent neural network on a specific input sequence of characters such as hello. In this specific example, we have four training samples:

The probability of the character e should be calculated given the context of the first input character h,
The probability of the character l should be calculated given the context of he,
The probability of the character l should be calculated given the context of hel, and
Finally the probability of the character o should be calculated given the context of hell

As we learned in previous chapters, machine learning techniques in general which deep learning is a part of, only accept real-value numbers as input. So, wee need somehow convert or encode or input character to a numerical form. To do this, we will use one-hot-vector encoding which is a way to encode text by have a vector of zeros except for a single entry in the vector, which is the index of the character in the vocabulary of this language that we are trying to model (in this case helo). After encoding our training samples, we will provide them to the RNN-type model one at a time. At each given character, the output of the RNN-type model will be a 4-dimensional vector (the size of the vector corresponds to the size of the vocab) which represents the probability of each character in the vocabulary being the next one after the given input character. Figure 4 clarifies this process:

Figure 4: Example of RNN-type network with one-hot-vector encoded characters as an input and the output will be distribution over the vocab representing the most likely character after the current one (source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

As shown in Figure 4, you can see that we fed the first character in our input sequence h to the model and the output was 4-dimensional vector representing the confidence about the next character. So it has a confidence of 1.0 of h being the next character after the input h, a confidence of 2.2 of e being the next character, a confidence of -3.0 to l being the next character, and finally a confidence of 4.1 to o being the next character. In this specific example, we know the correct next character will be e, based on our training sequence hello. So our primary goal while training this RNN-type network is increase the confidence of e being the next character and decrease the confidence of other characters. To do this kind of optimization we will be using gradient descent and backpropagation algorithms to update the weights and influence the network to produce a higher confidence for our correct next character, e, and so on, for the other 3 training examples.

As you can see the output of the RNN-type network produces a confidence distribution over all the characters of the vocab being the next one. We can turn this confidence distribution into a probability distribution such that the increase of one characters probability being the next one will result in decreasing the others probabilities because the probability needs to sum up to 1. For this specific modification we can use a standard softmax layer to every output vector.

For generating text from these kind of networks, we can feed an initial character to the model and get a probability distribution over the characters that are likely to be next, and then we can sample from these characters and feed it back as an input to the model. We'll be able to get a sequence of characters by repeating this process over and over again as many times as we want to generate a text with a desired length.

Table of Contents for Character-level language models

Create new playlist

Sign In

Sign Up

Table of Contents for
Character-level language models