Training and combining ngram taggers

In addition to UnigramTagger, there are two more NgramTagger subclasses: BigramTagger and TrigramTagger. The BigramTagger subclass uses the previous tag as part of its context, while the TrigramTagger subclass uses the previous two tags. An ngram is a subsequence of n items, so the BigramTagger subclass looks at two items (the previous tagged word and the current word), and the TrigramTagger subclass looks at three items.

These two taggers are good at handling words whose part-of-speech tag is context-dependent. Many words have a different part of speech depending on how they are used. For example, we've been talking about taggers that tag words. In this case, tag is used as a verb. But the result of tagging is a part-of-speech tag, so tag can also be a noun. The idea with the NgramTagger subclasses is that by looking at the previous words and part-of-speech tags, we can better guess the part-of-speech tag for the current word. Internally, each tagger maintains a context dictionary (implemented in the ContextTagger parent class) that is used to guess that tag based on the context. In the case of NgramTagger subclasses, the context is some number of previous tagged words.

Getting ready

How to do it...

By themselves, BigramTagger and TrigramTagger perform quite poorly. This is partly because they cannot learn context from the first word(s) in a sentence. Since a UnigramTagger class doesn't care about the previous context, it is able to have higher baseline accuracy by simply guessing the most common tag for each word.

>>> from nltk.tag import BigramTagger, TrigramTagger
>>> bitagger = BigramTagger(train_sents)
>>> bitagger.evaluate(test_sents)
>>> tritagger = TrigramTagger(train_sents)
>>> tritagger.evaluate(test_sents)

Where BigramTagger and TrigramTagger can make a contribution is when we combine them with backoff tagging. This time, instead of creating each tagger individually, we'll create a function that will take train_sents, a list of SequentialBackoffTagger classes, and an optional final backoff tagger, then train each tagger with the previous tagger as a backoff. Here's the code from

def backoff_tagger(train_sents, tagger_classes, backoff=None):
  for cls in tagger_classes:
    backoff = cls(train_sents, backoff=backoff)
  return backoff

And to use it, we can do the following:

>>> from tag_util import backoff_tagger
>>> backoff = DefaultTagger('NN')
>>> tagger = backoff_tagger(train_sents, [UnigramTagger, BigramTagger, TrigramTagger], backoff=backoff)
>>> tagger.evaluate(test_sents)

So, we've gained almost 1% accuracy by including the BigramTagger and TrigramTagger subclasses in the backoff chain. For corpora other than treebank, the accuracy gain may be more or less significant, depending on the nature of the text.

How it works...

The backoff_tagger function creates an instance of each tagger class in the list, giving it train_sents and the previous tagger as a backoff. The order of the list of tagger classes is quite important: the first class in the list (UnigramTagger) will be trained first and given the initial backoff tagger (the DefaultTagger). This tagger will then become the backoff tagger for the next tagger class in the list. The final tagger returned will be an instance of the last tagger class in the list (TrigramTagger). Here's some code to clarify this chain:

>>> tagger._taggers[-1] == backoff
>>> isinstance(tagger._taggers[0], TrigramTagger)
>>> isinstance(tagger._taggers[1], BigramTagger)

So, we get a TrigramTagger, whose first backoff is a BigramTagger. Then, the next backoff will be a UnigramTagger, whose backoff is the DefaultTagger.

There's more...

The backoff_tagger function doesn't just work with NgramTagger classes, it can also be used for constructing a chain containing any subclasses of SequentialBackoffTagger.

BigramTagger and TrigramTagger, because they are subclasses of NgramTagger and ContextTagger, can also take a model and cutoff argument, just like the UnigramTagger. But unlike for UnigramTagger, the context keys of the model must be two tuples, where the first element is a section of the history and the second element is the current token. For the BigramTagger, an appropriate context key looks like ((prevtag,), word), and for TrigramTagger, it looks like ((prevtag1, prevtag2), word).

Quadgram tagger

The NgramTagger class can be used by itself to create a tagger that uses more than three ngrams for its context key.

>>> from nltk.tag import NgramTagger
>>> quadtagger = NgramTagger(4, train_sents)
>>> quadtagger.evaluate(test_sents)

It's even worse than the TrigramTagger! Here's an alternative implementation of a QuadgramTagger class that we can include in a list to backoff_tagger. This code can be found in

from nltk.tag import NgramTagger

class QuadgramTagger(NgramTagger):
  def __init__(self, *args, **kwargs):
    NgramTagger.__init__(self, 4, *args, **kwargs)

This is essentially how BigramTagger and TrigramTagger are implemented: simple subclasses of NgramTagger that pass in the number of ngrams to look at in the history argument of the context() method.

Now, let's see how it does as part of a backoff chain.

>>> from taggers import QuadgramTagger
>>> quadtagger = backoff_tagger(train_sents, [UnigramTagger, BigramTagger, TrigramTagger, QuadgramTagger], backoff=backoff)
>>> quadtagger.evaluate(test_sents)

It's actually slightly worse than before, when we stopped with the TrigramTagger. So, the lesson is that too much context can have a negative effect on accuracy.

See also

