Chapter 2. Statistical Language Modeling

Computational linguistics is an emerging field that is widely used in analytics, software applications, and contexts where people communicate with machines. Computational linguistics may be defined as a subfield of artificial intelligence. Applications of computational linguistics include machine translation, speech recognition, intelligent Web searching, information retrieval, and intelligent spelling checkers. It is important to understand the preprocessing tasks or the computations that can be performed on natural language text. In the following chapter, we will discuss ways to calculate word frequencies, the Maximum Likelihood Estimation (MLE) model, interpolation on data, and so on. But first, let's go through the various topics that we will cover in this chapter. They are as follows:

  • Calculating word frequencies (1-gram, 2-gram, 3-gram)
  • Developing MLE for a given text
  • Applying smoothing on the MLE model
  • Developing a back-off mechanism for MLE
  • Applying interpolation on data to get a mix and match
  • Evaluating a language model through perplexity
  • Applying Metropolis-Hastings in modeling languages
  • Applying Gibbs sampling in language processing

Understanding word frequency

Collocations may be defined as the collection of two or more tokens that tend to exist together. For example, the United States, the United Kingdom, Union of Soviet Socialist Republics, and so on.

Unigram represents a single token. The following code will be used for generate unigrams for Alpino Corpus:

>>> import nltk
>>> from nltk.util import ngrams
>>> from nltk.corpus import alpino
>>> alpino.words()
['De', 'verzekeringsmaatschappijen', 'verhelen', ...]>>> unigrams=ngrams(alpino.words(),1)
>>> for i in unigrams:
print(i)

Consider another example for generating quadgrams or fourgrams from alpinocorpus:

>>> import nltk
>>> from nltk.util import ngrams
>>> from nltk.corpus import alpino
>>> alpino.words()
['De', 'verzekeringsmaatschappijen', 'verhelen', ...]
>>> quadgrams=ngrams(alpino.words(),4)
>>> for i in quadgrams:
print(i)

bigram refers to a pair of tokens. To find bigrams in the text, firstly, lowercased words are searched, a list of lowercased words in the text is created, and BigramCollocationFinder is produced. The BigramAssocMeasures found in the nltk.metrics package can be used to find bigrams in the text:

>>> import nltk
>>> from nltk.collocations import BigramCollocationFinder
>>> from nltk.corpus import webtext
>>> from nltk.metrics import BigramAssocMeasures
>>> tokens=[t.lower() for t in webtext.words('grail.txt')]
>>> words=BigramCollocationFinder.from_words(tokens)
>>> words.nbest(BigramAssocMeasures.likelihood_ratio, 10)
[("'", 's'), ('arthur', ':'), ('#', '1'), ("'", 't'), ('villager', '#'), ('#', '2'), (']', '['), ('1', ':'), ('oh', ','), ('black', 'knight')]

In the preceding code, we can add a word filter that can be used to eliminate stopwords and punctuation:

>>> from nltk.corpus import stopwords
>>> from nltk.corpus import webtext
>>> from nltk.collocations import BigramCollocationFinder
>>> from nltk.metrics import BigramAssocMeasures
>>> set = set(stopwords.words('english'))
>>> stops_filter = lambda w: len(w) < 3 or w in set
>>> tokens=[t.lower() for t in webtext.words('grail.txt')]
>>> words=BigramCollocationFinder.from_words(tokens)
>>> words.apply_word_filter(stops_filter)
>>> words.nbest(BigramAssocMeasures.likelihood_ratio, 10)
[('black', 'knight'), ('clop', 'clop'), ('head', 'knight'), ('mumble', 'mumble'), ('squeak', 'squeak'), ('saw', 'saw'), ('holy', 'grail'), ('run', 'away'), ('french', 'guard'), ('cartoon', 'character')]

Here, we can change the frequency of bigrams from 10 to any other number.

Another way of generating bigrams from a text is using collocation finders. This is given in the following code:

>>> import nltk
>>> from nltk.collocation import *
>>> text1="Hardwork is the key to success. Never give up!"
>>> word = nltk.wordpunct_tokenize(text1)
>>> finder = BigramCollocationFinder.from_words(word)
>>> bigram_measures = nltk.collocations.BigramAssocMeasures()
>>> value = finder.score_ngrams(bigram_measures.raw_freq)
>>> sorted(bigram for bigram, score in value)
[('.', 'Never'), ('Hardwork', 'is'), ('Never', 'give'), ('give', 'up'), ('is', 'the'), ('key', 'to'), ('success', '.'), ('the', 'key'), ('to', 'success'), ('up', '!')]

We will now see another code for generating bigrams from alpino corpus:

>>> import nltk
>>> from nltk.util import ngrams
>>> from nltk.corpus import alpino
>>> alpino.words()
['De', 'verzekeringsmaatschappijen', 'verhelen', ...]
>>> bigrams_tokens=ngrams(alpino.words(),2)
>>> for i in bigrams_tokens:
print(i) 

This code will generate bigrams from alpino corpus.

We will now see the code for generating trigrams:

>>> import nltk
>>> from nltk.util import ngrams
>>> from nltk.corpus import alpino
>>> alpino.words()
['De', 'verzekeringsmaatschappijen', 'verhelen', ...]>>> trigrams_tokens=ngrams(alpino.words(),3)
>>> for i in trigrams_tokens:
print(i) 

For generating fourgrams and generating the frequency of fourgrams, the following code is used:

>>> import nltk
>>> import nltk
>>> from nltk.collocations import *
>>> text="Hello how are you doing ? I hope you find the book interesting"
>>> tokens=nltk.wordpunct_tokenize(text)
>>> fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
>>> for fourgram, freq in fourgrams.ngram_fd.items():
print(fourgram,freq)

('hope', 'you', 'find', 'the') 1
('Hello', 'how', 'are', 'you') 1
('you', 'doing', '?', 'I') 1
('are', 'you', 'doing', '?') 1
('how', 'are', 'you', 'doing') 1
('?', 'I', 'hope', 'you') 1
('doing', '?', 'I', 'hope') 1
('find', 'the', 'book', 'interesting') 1
('you', 'find', 'the', 'book') 1
('I', 'hope', 'you', 'find') 1

We will now see the code for generating ngrams for a given sentence:

>>> import nltk
>>> sent=" Hello , please read the book thoroughly . If you have any queries , then don't hesitate to ask . There is no shortcut to success ."
>>> n=5
>>> fivegrams=ngrams(sent.split(),n)
>>> for grams in fivegrams:
	print(grams)


('Hello', ',', 'please', 'read', 'the')
(',', 'please', 'read', 'the', 'book')
('please', 'read', 'the', 'book', 'thoroughly')
('read', 'the', 'book', 'thoroughly', '.')
('the', 'book', 'thoroughly', '.', 'If')
('book', 'thoroughly', '.', 'If', 'you')
('thoroughly', '.', 'If', 'you', 'have')
('.', 'If', 'you', 'have', 'any')
('If', 'you', 'have', 'any', 'queries')
('you', 'have', 'any', 'queries', ',')
('have', 'any', 'queries', ',', 'then')
('any', 'queries', ',', 'then', "don't")
('queries', ',', 'then', "don't", 'hesitate')
(',', 'then', "don't", 'hesitate', 'to')
('then', "don't", 'hesitate', 'to', 'ask')
("don't", 'hesitate', 'to', 'ask', '.')
('hesitate', 'to', 'ask', '.', 'There')
('to', 'ask', '.', 'There', 'is')
('ask', '.', 'There', 'is', 'no')
('.', 'There', 'is', 'no', 'shortcut')
('There', 'is', 'no', 'shortcut', 'to')
('is', 'no', 'shortcut', 'to', 'success')
('no', 'shortcut', 'to', 'success', '.')

Develop MLE for a given text

MLE, also referred to as multinomial logistic regression or a conditional exponential classifier, is an essential task in the field of NLP. It was first introduced in 1996 by Berger and Della Pietra. Maximum Entropy is defined in NLTK in the nltk.classify.maxent module. In this module, all the probability distributions are considered that are in accordance with the training data. This model is used to refer to two features, namely input-feature and joint feature. An input feature may be called the feature of unlabeled words. A joined feature may be called the feature of labeled words. MLE is used to generate freqdist that contains the probability distribution for a given occurrence in a text. param freqdist consists of frequency distribution on which probability distribution is based.

We'll now see the code for the Maximum Entropy Model in NLTK:

from__future__import print_function,unicode_literals
__docformat__='epytext en'

try:
import numpy
except ImportError:
    pass
import tempfile
import os
from collections import defaultdict
from nltk import compat
from nltk.data import gzip_open_unicode
from nltk.util import OrderedDict
from nltk.probability import DictionaryProbDist
from nltk.classify.api import ClassifierI
from nltk.classify.util import CutoffChecker,accuracy,log_likelihood
from nltk.classify.megam import (call_megam,
write_megam_file,parse_megam_weights)
from nltk.classify.tadm import call_tadm,write_tadm_file,parse_tadm_weights

In the preceding code, nltk.probability consists of the FreqDist class that can be used to determine the frequency of the occurrence of individual tokens in a text.

The ProbDistI is used to determine the probability distribution of individual occurrences in a text. There are basically two kinds of probability distributions: Derived Probability Distribution and Analytic Probability Distribution. Distributed Probability Distributions are obtained from frequency distribution. Analytic Probability Distributions are obtained from parameters, such as variance.

In order to obtain the frequency distribution, the maximum likelihood estimate is used. It computes the probability of every occurrence on the basis of its frequency in the frequency distribution:

class MLEProbDist(ProbDistI):

    def __init__(self, freqdist, bins=None):
        self._freqdist = freqdist

    def freqdist(self):
"""

It will find the frequency distribution on the basis of probability distribution:

"""
    return self._freqdist

    def prob(self, sample):
        return self._freqdist.freq(sample)

    def max(self):
        return self._freqdist.max()

    def samples(self):
        return self._freqdist.keys()

    def __repr__(self):
"""
        It will return string representation of ProbDist
"""
        return '<MLEProbDist based on %d samples>' % self._freqdist.N()


class LidstoneProbDist(ProbDistI):
"""

It is used to obtain frequency distribution. It is represented by a real number, Gamma, whose range lies between 0 and 1. The LidstoneProbDist calculates the probability of a given observation with count c, outcomes N, and bins B as follows: (c+Gamma)/(N+B*Gamma).

It also means that Gamma is added to the count of each bin and MLE is computed from the given frequency distribution:

"""
SUM_TO_ONE = False
    def __init__(self, freqdist, gamma, bins=None):
"""

Lidstone is used to compute the probability distribution in order to obtain freqdist.

paramfreqdist may be defined as the frequency distribution on which probability estimates are based.

param bins may be defined as sample values that can be obtained from the probability distribution. The sum of probabilities is equal to one:

"""
        if (bins == 0) or (bins is None and freqdist.N() == 0):
            name = self.__class__.__name__[:-8]
            raise ValueError('A %s probability distribution ' % name +
'must have at least one bin.')
        if (bins is not None) and (bins < freqdist.B()):
            name = self.__class__.__name__[:-8]
            raise ValueError('
The number of bins in a %s distribution ' % name +
'(%d) must be greater than or equal to
' % bins +
'the number of bins in the FreqDist used ' +
'to create it (%d).' % freqdist.B())

        self._freqdist = freqdist
        self._gamma = float(gamma)
        self._N = self._freqdist.N()

        if bins is None:
            bins = freqdist.B()
        self._bins = bins

        self._divisor = self._N + bins * gamma
        if self._divisor == 0.0:
            # In extreme cases we force the probability to be 0,
            # which it will be, since the count will be 0:
            self._gamma = 0
            self._divisor = 1

def freqdist(self):
"""

It obtains frequency distribution, which is based upon the probability distribution:

    """
        return self._freqdist

def prob(self, sample):
c = self._freqdist[sample]
        return (c + self._gamma) / self._divisor

   def max(self):
 # To obtain most probable sample, choose the one 
# that occurs very frequently.
        return self._freqdist.max()

def samples(self):
        return self._freqdist.keys()

def discount(self):
    gb = self._gamma * self._bins
        return gb / (self._N + gb)

    def __repr__(self):
"""
        String representation of ProbDist is obtained.


"""
        return '<LidstoneProbDist based on %d samples>' % self._freqdist.N()


class LaplaceProbDist(LidstoneProbDist):
"""

It is used to obtain frequency distribution. It calculates the probability of a sample with count c, outcomes N, and bins B as follows:

(c+1)/(N+B)

It also means that 1 is added to the count of every bin, and the maximum likelihood is estimated for the resultant frequency distribution:

"""
    def __init__(self, freqdist, bins=None):
"""

LaplaceProbDist is used to obtain the probability distribution for generating freqdist.

param freqdist is used to obtain the frequency distribution, which is based on probability estimates.

Param bins may be defined as the frequency of sample values that can be generated. The sum of probabilities must be 1:

"""
        LidstoneProbDist.__init__(self, freqdist, 1, bins)

    def __repr__(self):
"""
        String representation of ProbDist is obtained.
"""
        return '<LaplaceProbDist based on %d samples>' % self._freqdist.N()

class ELEProbDist(LidstoneProbDist):
"""

It is used to obtain frequency distribution. It calculates the probability of a sample with count c, outcomes N, and bins B as follows:

(c+0.5)/(N+B/2)

It also means that 0.5 is added to the count of every bin and the maximum likelihood is estimated for the resultant frequency distribution:

"""
    def __init__(self, freqdist, bins=None):
"""

The expected likelihood estimation is used to obtain the probability distribution for generating freqdist.param.freqdist is used to obtain the frequency distribution, which is based on probability estimates.

param bins may be defined as the frequency of sample values that can be generated. The sum of probabilities must be 1:

"""
LidstoneProbDist.__init__(self, freqdist, 0.5, bins)

    def __repr__(self):
"""
        String representation of ProbDist is obtained.
    """
        return '<ELEProbDist based on %d samples>' % self._freqdist.N()



class WittenBellProbDist(ProbDistI):
"""

The WittenBellProbDist is used to obtain the probability distribution. It is used to obtain the uniform probability mass on the basis of the frequency of the sample seen before. The probability mass for the sample is given as follows:

T / (N + T)

Here, T is the number of samples observed and N is total number of events observed. It is equal to the maximum likelihood estimate of a new sample that is occurring. The sum of all the probabilities is equal to 1:

    Here,
          p = T / Z (N + T), if count = 0
          p = c / (N + T), otherwise
"""

    def __init__(self, freqdist, bins=None):
"""

It obtains the probability distribution. This probability is used to provide the uniform probability mass to an unseen sample. The probability mass for the sample is given as follows:

T / (N + T)

Here, T is the number of samples observed and N is the total number of events observed. It is equal to the maximum likelihood estimate of a new sample that is occurring. The sum of all the probabilities is equal to 1:

    Here,
          p = T / Z (N + T), if count = 0
          p = c / (N + T), otherwise

Z is the normalizing factor that is calculated using these values and a bin value.

Param freqdist is used to estimate the frequency counts from which the probability distribution is obtained.

Param bins may be defined as the number of possible types of samples:

"""
        assert bins is None or bins >= freqdist.B(),
'bins parameter must not be less than %d=freqdist.B()' % freqdist.B()
        if bins is None:
            bins = freqdist.B()
        self._freqdist = freqdist
        self._T = self._freqdist.B()
        self._Z = bins - self._freqdist.B()
        self._N = self._freqdist.N()
        # self._P0 is P(0), precalculated for efficiency:
        if self._N==0:
            # if freqdist is empty, we approximate P(0) by a UniformProbDist:
            self._P0 = 1.0 / self._Z
        else:
            self._P0 = self._T / float(self._Z * (self._N + self._T))

    def prob(self, sample):
        # inherit docs from ProbDistI
        c = self._freqdist[sample]
        return (c / float(self._N + self._T) if c != 0 else self._P0)

    def max(self):
        return self._freqdist.max()

    def samples(self):
        return self._freqdist.keys()

    def freqdist(self):
        return self._freqdist

    def discount(self):
        raise NotImplementedError()

    def __repr__(self):
"""
        String representation of ProbDist is obtained.


"""
        return '<WittenBellProbDist based on %d samples>' % self._freqdist.N()

We can perform testing using maximum likelihood estimation. Let's consider the following code for MLE in NLTK:

>>> import nltk
>>> from nltk.probability import *
>>> train_and_test(mle)
28.76%
>>> train_and_test(LaplaceProbDist)
69.16%
>>> train_and_test(ELEProbDist)
76.38%
>>> def lidstone(gamma):
    return lambda fd, bins: LidstoneProbDist(fd, gamma, bins)

>>> train_and_test(lidstone(0.1))
86.17%
>>> train_and_test(lidstone(0.5))
76.38%
>>> train_and_test(lidstone(1.0))
69.16%

Hidden Markov Model estimation

Hidden Markov Model (HMM) comprises of observed states and the latent states that help in determining them. Consider the diagrammatic description of HMM. Here, x represents the latent state and y represents the observed state.

Hidden Markov Model estimation

We can perform testing using HMM estimation. Let's consider the Brown Corpus and the code given here:

>>> import nltk
>>> corpus = nltk.corpus.brown.tagged_sents(categories='adventure')[:700]
>>> print(len(corpus))
700
>>> from nltk.util import unique_list
>>> tag_set = unique_list(tag for sent in corpus for (word,tag) in sent)
>>> print(len(tag_set))
104
>>> symbols = unique_list(word for sent in corpus for (word,tag) in sent)
>>> print(len(symbols))
1908
>>> print(len(tag_set))
104
>>> symbols = unique_list(word for sent in corpus for (word,tag) in sent)
>>> print(len(symbols))
1908
>>> trainer = nltk.tag.HiddenMarkovModelTrainer(tag_set, symbols)
>>> train_corpus = []
>>> test_corpus = []
>>> for i in range(len(corpus)):
if i % 10:
train_corpus += [corpus[i]]
else:
test_corpus += [corpus[i]]


>>> print(len(train_corpus))
630
>>> print(len(test_corpus))
70
>>> def train_and_test(est):
hmm = trainer.train_supervised(train_corpus, estimator=est)
print('%.2f%%' % (100 * hmm.evaluate(test_corpus)))

In the preceding code, we have created a 90% training and 10% testing file and we have tested the estimator.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset