Creating a model of likely word tags

As previously mentioned in the Training a unigram part-of-speech tagger recipe, using a custom model with a UnigramTagger class should only be done if you know exactly what you're doing. In this recipe, we're going to create a model for the most common words, most of which always have the same tag no matter what.

How to do it...

To find the most common words, we can use nltk.probability.FreqDist to count word frequencies in the treebank corpus. Then, we can create a ConditionalFreqDist class for tagged words, where we count the frequency of every tag for every word. Using these counts, we can construct a model of the 200 most frequent words as keys, with the most frequent tag for each word as a value. Here's the model creation function defined in tag_util.py.

from nltk.probability import FreqDist, ConditionalFreqDist

def word_tag_model(words, tagged_words, limit=200):
  fd = FreqDist(words)
  cfd = ConditionalFreqDist(tagged_words)
  most_freq = (word for word, count in fd.most_common(limit))
  return dict((word, cfd[word].max()) for word in most_freq)

And to use it with a UnigramTagger class, we can do the following:

>>> from tag_util import word_tag_model
>>> from nltk.corpus import treebank
>>> model = word_tag_model(treebank.words(), treebank.tagged_words())
>>> tagger = UnigramTagger(model=model)
>>> tagger.evaluate(test_sents)
0.559680552557738

An accuracy of almost 56% is ok, but nowhere near as good as the trained UnigramTagger. Let's try adding it to our backoff chain.

>>> default_tagger = DefaultTagger('NN')
>>> likely_tagger = UnigramTagger(model=model, backoff=default_tagger)
>>> tagger = backoff_tagger(train_sents, [UnigramTagger, BigramTagger, TrigramTagger], backoff=likely_tagger)
>>> tagger.evaluate(test_sents)
0.8806820634578028

The final accuracy is exactly the same as without the likely_tagger. This is because the frequency calculations we did to create the model are almost exactly the same as what happens when we train a UnigramTagger class.

How it works...

The word_tag_model() function takes a list of all words, a list of all tagged words, and the maximum number of words we want to use for our model. We give the list of words to a FreqDist class, which counts the frequency of each word. Then, we get the top 200 words from the FreqDist class by calling fd.most_common(), which obviously returns a list of the most common words and counts. The FreqDist class is actually a subclass of collections.Counter, which provides the most_common() method.

Next, we give the list of tagged words to ConditionalFreqDist, which creates a FreqDist class of tags for each word, with the word as the condition. Finally, we return a dict of the top 200 words mapped to their most likely tag.

Note

In the previous edition of this book, we used the keys() method of the FreqDist class because in NLTK2, the keys were returned in sorted order, from the most frequent to the least. But in NLTK3, FreqDist inherits from collections.Counter, and the keys() method does not use any predictable ordering.

There's more...

It may seem useless to include this tagger as it does not change the accuracy. But the point of this recipe is to demonstrate how to construct a useful model for a UnigramTagger class. Custom model construction is a way to create a manual override of trained taggers that are otherwise black boxes. And by putting the likely_tagger at the front of the chain, we can actually improve accuracy a little bit:

>>> tagger = backoff_tagger(train_sents, [UnigramTagger, BigramTagger, TrigramTagger], backoff=default_tagger)
>>> likely_tagger = UnigramTagger(model=model, backoff=tagger)
>>> likely_tagger.evaluate(test_sents)
0.8824088063889488

Putting custom model taggers at the front of the backoff chain gives you complete control over how specific words are tagged, while letting the trained taggers handle everything else.

See also

The Training a unigram part-of-speech tagger recipe has details on the UnigramTagger class and a simple custom model example. See the earlier recipes Combining taggers with backoff tagging and Training and combining ngram taggers for details on backoff tagging.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset