Accounting for unseen words and other oddities

When we calculated the probabilities earlier, we actually cheated ourselves. We were not calculating the real probabilities, but only rough approximations by means of the fractions. We assumed that the training corpus would tell us the whole truth about the real probabilities. It did not. A corpus of only six tweets obviously cannot give us all the information about every tweet that has ever been written. For example, there certainly are tweets containing the word text in them. It is only that we have never seen them. Apparently, our approximation is very rough, and we should account for that. This is often done in practice with the so-called add-one smoothing.

Add-one smoothing is sometimes also referred to as additive smoothing or Laplace smoothing. Note that Laplace smoothing has nothing to do with Laplacian smoothing, which is related to the smoothing of polygon meshes. If we do not smooth by 1 but by an adjustable parameter, alpha>0, it is called Lidstone smoothing.

It is a very simple technique that adds one to all feature occurrences. It has the underlying assumption that even if we have not seen a given word in the whole corpus, there is still a chance that our sample of tweets happened not to include that word. So, with add-one smoothing, we pretend that we have seen every occurrence once more than we actually did. That means that instead of calculating , we now do .

Why do we add 2 in the denominator? Because we have two features: the occurrence of awesome and crazy. Since we add 1 for each feature, we have to make sure that the end result is again a probability. And, indeed we get 1 as the total probability:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset