Smoothing is used to handle the words that have not occurred previously. So, the probability of unknown words is 0. To solve this problem, smoothing is used.
In the 18th century, Laplace invented add-one smoothing. In add-one smoothing, 1 is added to the count of each word. Instead of 1, any other value can also be added to the count of unknown words so that unknown words can be handled and their probability is non-zero. Pseudo count is the value (that is, either 1 or nonzero) that is added to the counts of unknown words to make their probability nonzero.
Let's consider the following code for add-one smoothing in NLTK:
>>> import nltk >>> corpus=u"<s> hello how are you doing ? Hope you find the book interesting. </s>".split() >>> sentence=u"<s>how are you doing</s>".split() >>> vocabulary=set(corpus) >>> len(vocabulary) 13 >>> cfd = nltk.ConditionalFreqDist(nltk.bigrams(corpus)) >>> # The corpus counts of each bigram in the sentence: >>> [cfd[a][b] for (a,b) in nltk.bigrams(sentence)] [0, 1, 0] >>> # The counts for each word in the sentence: >>> [cfd[a].N() for (a,b) in nltk.bigrams(sentence)] [0, 1, 2] >>> # There is already a FreqDist method for MLE probability: >>> [cfd[a].freq(b) for (a,b) in nltk.bigrams(sentence)] [0, 1.0, 0.0] >>> # Laplace smoothing of each bigram count: >>> [1 + cfd[a][b] for (a,b) in nltk.bigrams(sentence)] [1, 2, 1] >>> # We need to normalise the counts for each word: >>> [len(vocabulary) + cfd[a].N() for (a,b) in nltk.bigrams(sentence)] [13, 14, 15] >>> # The smoothed Laplace probability for each bigram: >>> [1.0 * (1+cfd[a][b]) / (len(vocabulary)+cfd[a].N()) for (a,b) in nltk.bigrams(sentence)] [0.07692307692307693, 0.14285714285714285, 0.06666666666666667]
Consider another way of performing Add-one smoothing or generating a Laplace probability distribution:
>>> # MLEProbDist is the unsmoothed probability distribution: >>> cpd_mle = nltk.ConditionalProbDist(cfd, nltk.MLEProbDist, bins=len(vocabulary)) >>> # Now we can get the MLE probabilities by using the .prob method: >>> [cpd_mle[a].prob(b) for (a,b) in nltk.bigrams(sentence)] [0, 1.0, 0.0] >>> # LaplaceProbDist is the add-one smoothed ProbDist: >>> cpd_laplace = nltk.ConditionalProbDist(cfd, nltk.LaplaceProbDist, bins=len(vocabulary)) >>> # Getting the Laplace probabilities is the same as for MLE: >>> [cpd_laplace[a].prob(b) for (a,b) in nltk.bigrams(sentence)] [0.07692307692307693, 0.14285714285714285, 0.06666666666666667]
Good Turing was introduced by Alan Turing along with his statistical assistant I.J. Good. It is an efficient smoothing method that increases the performance of statistical techniques performed for linguistic tasks, such as word sense disambiguation (WSD), named entity recognition (NER), spelling correction, machine translation, and so on. This method helps to predict the probability of unseen objects. In this method, binomial distribution is exhibited by our objects of interest. This method is used to compute the mass probability for zero or low count samples on the basis of higher count samples . Simple Good Turing performs approximation from frequency to frequency by linear regression into a linear line in log space. If c is the adjusted count, it will compute the following:
c = (c + 1) N(c + 1) / N(c) for c >= 1
The samples with zero frequency in training = N(1) for c == 0.
Here, c is the original count and N(i) is the number of event types observed with count i
.
Bill Gale and Geoffrey Sampson have presented Simple Good Turing:
class SimpleGoodTuringProbDist(ProbDistI): """ Given a pair (pi, qi), where pi refers to the frequency and qi refers to the frequency of frequency, our aim is to minimize the square variation. E(p) and E(q) is the mean of pi and qi. - slope, b = sigma ((pi-E(p)(qi-E(q))) / sigma ((pi-E(p))(pi-E(p))) - intercept: a = E(q) - b.E(p) """ SUM_TO_ONE = False def __init__(self, freqdist, bins=None): """ param freqdist refers to the count of frequency from which probability distribution is estimated. Param bins is used to estimate the possible number of samples. """ assert bins is None or bins > freqdist.B(), 'bins parameter must not be less than %d=freqdist.B()+1' % (freqdist.B()+1) if bins is None: bins = freqdist.B() + 1 self._freqdist = freqdist self._bins = bins r, nr = self._r_Nr() self.find_best_fit(r, nr) self._switch(r, nr) self._renormalize(r, nr) def _r_Nr_non_zero(self): r_Nr = self._freqdist.r_Nr() del r_Nr[0] return r_Nr def _r_Nr(self): """ Split the frequency distribution in two list (r, Nr), where Nr(r) > 0 """ nonzero = self._r_Nr_non_zero() if not nonzero: return [], [] return zip(*sorted(nonzero.items())) def find_best_fit(self, r, nr): """ Use simple linear regression to tune parameters self._slope and self._intercept in the log-log space based on count and Nr(count) (Work in log space to avoid floating point underflow.) """ # For higher sample frequencies the data points becomes horizontal # along line Nr=1. To create a more evident linear model in log-log # space, we average positive Nr values with the surrounding zero # values. (Church and Gale, 1991) if not r or not nr: # Empty r or nr? return zr = [] for j in range(len(r)): i = (r[j-1] if j > 0 else 0) k = (2 * r[j] - i if j == len(r) - 1 else r[j+1]) zr_ = 2.0 * nr[j] / (k - i) zr.append(zr_) log_r = [math.log(i) for i in r] log_zr = [math.log(i) for i in zr] xy_cov = x_var = 0.0 x_mean = 1.0 * sum(log_r) / len(log_r) y_mean = 1.0 * sum(log_zr) / len(log_zr) for (x, y) in zip(log_r, log_zr): xy_cov += (x - x_mean) * (y - y_mean) x_var += (x - x_mean)**2 self._slope = (xy_cov / x_var if x_var != 0 else 0.0) if self._slope >= -1: warnings.warn('SimpleGoodTuring did not find a proper best fit ' 'line for smoothing probabilities of occurrences. ' 'The probability estimates are likely to be ' 'unreliable.') self._intercept = y_mean - self._slope * x_mean def _switch(self, r, nr): """ Calculate the r frontier where we must switch from Nr to Sr when estimating E[Nr]. """ for i, r_ in enumerate(r): if len(r) == i + 1 or r[i+1] != r_ + 1: # We are at the end of r, or there is a gap in r self._switch_at = r_ break Sr = self.smoothedNr smooth_r_star = (r_ + 1) * Sr(r_+1) / Sr(r_) unsmooth_r_star = 1.0 * (r_ + 1) * nr[i+1] / nr[i] std = math.sqrt(self._variance(r_, nr[i], nr[i+1])) if abs(unsmooth_r_star-smooth_r_star) <= 1.96 * std: self._switch_at = r_ break def _variance(self, r, nr, nr_1): r = float(r) nr = float(nr) nr_1 = float(nr_1) return (r + 1.0)**2 * (nr_1 / nr**2) * (1.0 + nr_1 / nr) def _renormalize(self, r, nr): """
Renormalization is very crucial to ensure that the proper distribution of probability is obtained. It can be obtained by making the probability estimate of an unseen sample N(1)/N and then, renormalizing all the previously seen sample probabilities:
""" prob_cov = 0.0 for r_, nr_ in zip(r, nr): prob_cov += nr_ * self._prob_measure(r_) if prob_cov: self._renormal = (1 - self._prob_measure(0)) / prob_cov def smoothedNr(self, r): """ Return the number of samples with count r. """ # Nr = a*r^b (with b < -1 to give the appropriate hyperbolic # relationship) # Estimate a and b by simple linear regression technique on # the logarithmic form of the equation: log Nr = a + b*log(r) return math.exp(self._intercept + self._slope * math.log(r)) def prob(self, sample): """ Return the sample's probability. """ count = self._freqdist[sample] p = self._prob_measure(count) if count == 0: if self._bins == self._freqdist.B(): p = 0.0 else: p = p / (1.0 * self._bins - self._freqdist.B()) else: p = p * self._renormal return p def _prob_measure(self, count): if count == 0 and self._freqdist.N() == 0 : return 1.0 elif count == 0 and self._freqdist.N() != 0: return 1.0 * self._freqdist.Nr(1) / self._freqdist.N() if self._switch_at > count: Er_1 = 1.0 * self._freqdist.Nr(count+1) Er = 1.0 * self._freqdist.Nr(count) else: Er_1 = self.smoothedNr(count+1) Er = self.smoothedNr(count) r_star = (count + 1) * Er_1 / Er return r_star / self._freqdist.N() def check(self): prob_sum = 0.0 for i in range(0, len(self._Nr)): prob_sum += self._Nr[i] * self._prob_measure(i) / self._renormal print("Probability Sum:", prob_sum) #assert prob_sum != 1.0, "probability sum should be one!" def discount(self): """ It is used to provide the total probability transfers from the seen events to the unseen events. """ return 1.0 * self.smoothedNr(1) / self._freqdist.N() def max(self): return self._freqdist.max() def samples(self): return self._freqdist.keys() def freqdist(self): return self._freqdist def __repr__(self): """ It obtains the string representation of ProbDist. """ return '<SimpleGoodTuringProbDist based on %d samples>' % self._freqdist.N()
Let's see the code for Simple Good Turing in NLTK:
>>> gt = lambda fd, bins: SimpleGoodTuringProbDist(fd, bins=1e5) >>> train_and_test(gt) 5.17%
Kneser Ney is used with trigrams. Let's see the following code in NLTK for the Kneser Ney estimation:
>>> import nltk >>> corpus = [[((x[0],y[0],z[0]),(x[1],y[1],z[1])) for x, y, z in nltk.trigrams(sent)] for sent in corpus[:100]] >>> tag_set = unique_list(tag for sent in corpus for (word,tag) in sent) >>> len(tag_set) 906 >>> symbols = unique_list(word for sent in corpus for (word,tag) in sent) >>> len(symbols) 1341 >>> trainer = nltk.tag.HiddenMarkovModelTrainer(tag_set, symbols) >>> train_corpus = [] >>> test_corpus = [] >>> for i in range(len(corpus)): if i % 10: train_corpus += [corpus[i]] else: test_corpus += [corpus[i]] >>> len(train_corpus) 90 >>> len(test_corpus) 10 >>> kn = lambda fd, bins: KneserNeyProbDist(fd) >>> train_and_test(kn) 0.86%