Understanding language models

In the English language, the character a appears much more often in words and sentences than the character x. Similarly, we can also observe that the word is occurs more frequently than the word specimen. It is possible to learn the probability distributions of characters and words by examining large volumes of text. The following screenshot is a chart showing the probability distribution of letters given a corpus (text dataset):

Probability distribution of letters in a corpus

We can observe that the probability distributions of characters are non-uniform. This essentially means that we can recover the characters in a word, even if they are lost due to noise. If a particular character is missing in a word, it can be reconstructed just based on the characters that are surrounding the missing character. The reconstruction of the missing character is not done randomly, but is done by picking the character that has the highest probability distribution of occurrence, given the characters that are surrounding the missing character. Technically speaking, the statistical structure of words in a sentence or characters in words follows the distance from maximal entropy.

A language model exploits the statistical structure of a language to express the following:

Given w_1, w_2, w_3,...w_N words in a sentence, a language model assigns a probability to a sentence P(w_1, w_2, w_3,.... w_N).
It then assigns probability of an upcoming word (w_4 in this case) as P(w_4 | w_1, w_2, w_3).

Language models enable a number of applications to be developed in NLP, and some of them are listed as follows:

Machine translation: P(enormous cyclone tonight) > P(gain typhoon this evening)
Spelling correction: P(satellite constellation) > P(satelitte constellation)
Speech recognition: P(I saw a van) > P(eyes awe of an)
Typing prediction: Auto completion of in Google search, typing assistance apps

Let's now look at how the probabilities are calculated for the words. Consider a simple sentence, Decembers are cold. The probability of this sentence is expressed as follows:

P("Decembers are cold") = P("December") * P ("are" | "Decembers") * P("cold" | "Decembers are")

Mathematically, the probability computation of words in a sentence (or letters in a word) can be expressed as follows:

Andrey Markov, a Russian mathematician, described a stochastic process with a property called Markov Property or Markov Assumption. This basically states that one can make predictions for the future of the process based solely on its present state, just as well as one could knowing the process's full history, hence independently from such history.

Based on Markov's assumption, we can rewrite the conditional probability of cold as follows:

P("cold" | "Decembers are") is congruent to P("cold" | "are")

Mathematically, Markov's assumption can be expressed as follows:

While this mathematical formulation represents the bigram model (two words taken into consideration at a time), it can be easily extended to an n-gram model. In the n-gram model, the conditional probability depends on just a couple more previous words.

Mathematically, an n-gram model is expressed as follows:

Consider the famous poem A Girl by Ezra Pound as our corpus for building a bigram model. The following is the text corpus:

The tree has entered my hands,
The sap has ascended my arms,
The tree has grown in my breast-Downward,
The branches grow out of me, like arms.
Tree you are,
Moss you are,
You are violets with wind above them.
A child - so high - you are,
And all this is folly to the world.

We are already aware that in a bigram model, the conditional probability is computed just based on the previous word. So, the probability of a word can be computed as follows:

If we were to compute the probability of the word arms given the word my in the poem, it is computed as the number of times the words arms and my appear together in the poem, divided by the number of times the word my appears in the poem.

We see that the words my arms appeared in the poem only once (in the sentence The sap has ascended my arms). However, the word my appeared in the poem three times (in the sentences The tree has entered my hands, The sap has ascended my arms, and The tree has grown in my breast-Downward).

Therefore, the conditional probability of the word arms given my is 1/3, formally represented as follows:

P("arms" | "my") = P("arms", "my") / P("my") = 1 / 3

To calculate probability of the first and last words, the special tags <BOS> and <EOS> are added at the start and end of sentences, respectively. Similarly, the probability of a sentence or sequence of words can be calculated using the same approach by multiplying all the bigram probabilities.

As language modeling involves predicting the next word in a sequence, given the sequence of words already present, we can train a language model to create subsequent words in a sequence from a given starting sequence.

Table of Contents for Understanding language models

Create new playlist

Sign In

Sign Up

Table of Contents for
Understanding language models