In addition to UnigramTagger
, there are two more NgramTagger
subclasses: BigramTagger
and TrigramTagger
. The BigramTagger
subclass uses the previous tag as part of its context, while the TrigramTagger
subclass uses the previous two tags. An ngram is a subsequence of n items, so the BigramTagger
subclass looks at two items (the previous tagged word and the current word), and the TrigramTagger
subclass looks at three items.
These two taggers are good at handling words whose part-of-speech tag is context-dependent. Many words have a different part of speech depending on how they are used. For example, we've been talking about taggers that tag words. In this case, tag is used as a verb. But the result of tagging is a part-of-speech tag, so tag can also be a noun. The idea with the NgramTagger
subclasses is that by looking at the previous words and part-of-speech tags, we can better guess the part-of-speech tag for the current word. Internally, each tagger maintains a context dictionary (implemented in the ContextTagger
parent class) that is used to guess that tag based on the context. In the case of NgramTagger
subclasses, the context is some number of previous tagged words.
Refer to the first two recipes of this chapter for details on constructing train_sents
and test_sents
.
By themselves, BigramTagger
and TrigramTagger
perform quite poorly. This is partly because they cannot learn context from the first word(s) in a sentence. Since a UnigramTagger
class doesn't care about the previous context, it is able to have higher baseline accuracy by simply guessing the most common tag for each word.
>>> from nltk.tag import BigramTagger, TrigramTagger >>> bitagger = BigramTagger(train_sents) >>> bitagger.evaluate(test_sents) 0.11310166199007123 >>> tritagger = TrigramTagger(train_sents) >>> tritagger.evaluate(test_sents) 0.0688107058061731
Where BigramTagger
and TrigramTagger
can make a contribution is when we combine them with backoff tagging. This time, instead of creating each tagger individually, we'll create a function that will take train_sents
, a list of SequentialBackoffTagger
classes, and an optional final backoff tagger, then train each tagger with the previous tagger as a backoff. Here's the code from tag_util.py
:
def backoff_tagger(train_sents, tagger_classes, backoff=None): for cls in tagger_classes: backoff = cls(train_sents, backoff=backoff) return backoff
And to use it, we can do the following:
>>> from tag_util import backoff_tagger >>> backoff = DefaultTagger('NN') >>> tagger = backoff_tagger(train_sents, [UnigramTagger, BigramTagger, TrigramTagger], backoff=backoff) >>> tagger.evaluate(test_sents) 0.8806820634578028
So, we've gained almost 1% accuracy by including the BigramTagger
and TrigramTagger
subclasses in the backoff chain. For corpora other than treebank
, the accuracy gain may be more or less significant, depending on the nature of the text.
The backoff_tagger
function creates an instance of each tagger class in the list, giving it train_sents
and the previous tagger as a backoff. The order of the list of tagger classes is quite important: the first class in the list (UnigramTagger
) will be trained first and given the initial backoff tagger (the DefaultTagger
). This tagger will then become the backoff tagger for the next tagger class in the list. The final tagger returned will be an instance of the last tagger class in the list (TrigramTagger
). Here's some code to clarify this chain:
>>> tagger._taggers[-1] == backoff True >>> isinstance(tagger._taggers[0], TrigramTagger) True >>> isinstance(tagger._taggers[1], BigramTagger) True
So, we get a TrigramTagger
, whose first backoff is a BigramTagger
. Then, the next backoff will be a UnigramTagger
, whose backoff is the DefaultTagger
.
The backoff_tagger
function doesn't just work with NgramTagger
classes, it can also be used for constructing a chain containing any subclasses of SequentialBackoffTagger
.
BigramTagger
and TrigramTagger
, because they are subclasses of NgramTagger
and ContextTagger
, can also take a model and cutoff argument, just like the UnigramTagger
. But unlike for UnigramTagger
, the context keys of the model must be two tuples, where the first element is a section of the history and the second element is the current token. For the BigramTagger
, an appropriate context key looks like ((prevtag,), word)
, and for TrigramTagger
, it looks like ((prevtag1, prevtag2), word)
.
The NgramTagger
class can be used by itself to create a tagger that uses more than three ngrams for its context key.
>>> from nltk.tag import NgramTagger >>> quadtagger = NgramTagger(4, train_sents) >>> quadtagger.evaluate(test_sents) 0.058234405352903085
It's even worse than the TrigramTagger
! Here's an alternative implementation of a QuadgramTagger
class that we can include in a list to backoff_tagger
. This code can be found in taggers.py
.
from nltk.tag import NgramTagger class QuadgramTagger(NgramTagger): def __init__(self, *args, **kwargs): NgramTagger.__init__(self, 4, *args, **kwargs)
This is essentially how BigramTagger
and TrigramTagger
are implemented: simple subclasses of NgramTagger
that pass in the number of ngrams to look at in the history
argument of the context()
method.
Now, let's see how it does as part of a backoff chain.
>>> from taggers import QuadgramTagger >>> quadtagger = backoff_tagger(train_sents, [UnigramTagger, BigramTagger, TrigramTagger, QuadgramTagger], backoff=backoff) >>> quadtagger.evaluate(test_sents) 0.8806388948845241
It's actually slightly worse than before, when we stopped with the TrigramTagger
. So, the lesson is that too much context can have a negative effect on accuracy.