As previously mentioned in the Training a unigram part-of-speech tagger recipe, using a custom model with a UnigramTagger
class should only be done if you know exactly what you're doing. In this recipe, we're going to create a model for the most common words, most of which always have the same tag no matter what.
To find the most common words, we can use nltk.probability.FreqDist
to count word frequencies in the treebank
corpus. Then, we can create a ConditionalFreqDist
class for tagged words, where we count the frequency of every tag for every word. Using these counts, we can construct a model of the 200 most frequent words as keys, with the most frequent tag for each word as a value. Here's the model creation function defined in tag_util.py
.
from nltk.probability import FreqDist, ConditionalFreqDist def word_tag_model(words, tagged_words, limit=200): fd = FreqDist(words) cfd = ConditionalFreqDist(tagged_words) most_freq = (word for word, count in fd.most_common(limit)) return dict((word, cfd[word].max()) for word in most_freq)
And to use it with a UnigramTagger
class, we can do the following:
>>> from tag_util import word_tag_model >>> from nltk.corpus import treebank >>> model = word_tag_model(treebank.words(), treebank.tagged_words()) >>> tagger = UnigramTagger(model=model) >>> tagger.evaluate(test_sents) 0.559680552557738
An accuracy of almost 56% is ok, but nowhere near as good as the trained UnigramTagger
. Let's try adding it to our backoff chain.
>>> default_tagger = DefaultTagger('NN') >>> likely_tagger = UnigramTagger(model=model, backoff=default_tagger) >>> tagger = backoff_tagger(train_sents, [UnigramTagger, BigramTagger, TrigramTagger], backoff=likely_tagger) >>> tagger.evaluate(test_sents) 0.8806820634578028
The final accuracy is exactly the same as without the likely_tagger
. This is because the frequency calculations we did to create the model are almost exactly the same as what happens when we train a UnigramTagger
class.
The word_tag_model()
function takes a list of all words, a list of all tagged words, and the maximum number of words we want to use for our model. We give the list of words to a FreqDist
class, which counts the frequency of each word. Then, we get the top 200 words from the FreqDist
class by calling fd.most_common()
, which obviously returns a list of the most common words and counts. The FreqDist
class is actually a subclass of collections.Counter
, which provides the most_common()
method.
Next, we give the list of tagged words to ConditionalFreqDist
, which creates a FreqDist
class of tags for each word, with the word as the condition. Finally, we return a dict
of the top 200 words mapped to their most likely tag.
In the previous edition of this book, we used the keys()
method of the FreqDist
class because in NLTK2, the keys were returned in sorted order, from the most frequent to the least. But in NLTK3, FreqDist
inherits from collections.Counter
, and the keys()
method does not use any predictable ordering.
It may seem useless to include this tagger as it does not change the accuracy. But the point of this recipe is to demonstrate how to construct a useful model for a UnigramTagger
class. Custom model construction is a way to create a manual override of trained taggers that are otherwise black boxes. And by putting the likely_tagger
at the front of the chain, we can actually improve accuracy a little bit:
>>> tagger = backoff_tagger(train_sents, [UnigramTagger, BigramTagger, TrigramTagger], backoff=default_tagger) >>> likely_tagger = UnigramTagger(model=model, backoff=tagger) >>> likely_tagger.evaluate(test_sents) 0.8824088063889488
Putting custom model taggers at the front of the backoff chain gives you complete control over how specific words are tagged, while letting the trained taggers handle everything else.