Affix tagging

The AffixTagger class is another ContextTagger subclass, but this time the context is either the prefix or the suffix of a word. This means the AffixTagger class is able to learn tags based on fixed-length substrings of the beginning or ending of a word.

How to do it...

The default arguments for an AffixTagger class specify three-character suffixes, and that words must be at least five characters long. If a word is less than five characters, then None is returned as the tag.

>>> from nltk.tag import AffixTagger
>>> tagger = AffixTagger(train_sents)
>>> tagger.evaluate(test_sents)
0.27558817181092166

So, it does ok by itself with the default arguments. Let's try it by specifying three-character prefixes.

>>> prefix_tagger = AffixTagger(train_sents, affix_length=3)
>>> prefix_tagger.evaluate(test_sents)
0.23587308439456076

To learn on two-character suffixes, the code will look like this:

>>> suffix_tagger = AffixTagger(train_sents, affix_length=-2)
>>> suffix_tagger.evaluate(test_sents)
0.31940427368875457

How it works...

A positive value for affix_length means that the AffixTagger class will learn word prefixes, essentially word[:affix_length]. If affix_length is negative, then suffixes are learned using word[affix_length:].

There's more...

You can combine multiple affix taggers in a backoff chain if you want to learn on multiple character length affixes. Here's an example of four AffixTagger classes learning on 2 and 3 character prefixes and suffixes:

>>> pre3_tagger = AffixTagger(train_sents, affix_length=3)
>>> pre3_tagger.evaluate(test_sents)
0.23587308439456076
>>> pre2_tagger = AffixTagger(train_sents, affix_length=2, backoff=pre3_tagger)
>>> pre2_tagger.evaluate(test_sents)
0.29786315562270665
>>> suf2_tagger = AffixTagger(train_sents, affix_length=-2, backoff=pre2_tagger)
>>> suf2_tagger.evaluate(test_sents)
0.32467083962875026
>>> suf3_tagger = AffixTagger(train_sents, affix_length=-3, backoff=suf2_tagger)
>>> suf3_tagger.evaluate(test_sents)
0.3590761925318368

As you can see, the accuracy goes up each time.

Note

The ordering in the previous block of code is not the best, nor is it the worst. I'll leave it to you to explore the possibilities and discover the best backoff chain of values for AffixTagger and affix_length.

Working with min_stem_length

The AffixTagger class also takes a min_stem_length keyword argument, with a default value of 2. If the word length is less than min_stem_length plus the absolute value of affix_length, then None is returned by the context() method. Increasing min_stem_length forces the AffixTagger class to only learn on longer words, while decreasing min_stem_length will allow it to learn on shorter words. Of course, for shorter words, the affix_length argument could be equal to or greater than the word length, and AffixTagger would essentially be acting like a UnigramTagger class.

See also

You can manually specify prefixes and suffixes using regular expressions, as shown in the previous recipe. The Training a unigram part-of-speech tagger and Training and combining ngram taggers recipes have details on NgramTagger subclasses, which are also subclasses of ContextTagger.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset