Using WordNet for tagging

If you remember from the Looking up Synsets for a word in WordNet recipe in Chapter 1, Tokenizing Text and WordNet Basics, WordNet Synsets specify a part-of-speech tag. It's a very restricted set of possible tags, and many words have multiple Synsets with different part-of-speech tags, but this information can be useful for tagging unknown words. WordNet is essentially a giant dictionary, and it's likely to contain many words that are not in your training data.

Getting ready

First, we need to decide how to map WordNet part-of-speech tags to the Penn Treebank part-of-speech tags we've been using. The following is a table mapping one to the other. See the Looking up Synsets for a word in WordNet recipe in Chapter 1, Tokenizing Text and WordNet Basics, for more details. The s, which was not shown before, is just another kind of adjective, at least for tagging purposes.

WordNet tag

Treebank tag

n

NN

a

JJ

s

JJ

r

RB

v

VB

How to do it...

Now we can create a class that will look up words in WordNet, and then choose the most common tag from the Synsets it finds. The WordNetTagger class defined in the following code can be found in taggers.py:

from nltk.tag import SequentialBackoffTagger
from nltk.corpus import wordnet
from nltk.probability import FreqDist

class WordNetTagger(SequentialBackoffTagger):
  '''
  >>> wt = WordNetTagger()
  >>> wt.tag(['food', 'is', 'great'])
  [('food', 'NN'), ('is', 'VB'), ('great', 'JJ')]
  '''
  def __init__(self, *args, **kwargs):
    SequentialBackoffTagger.__init__(self, *args, **kwargs)

    self.wordnet_tag_map = {
      'n': 'NN',
      's': 'JJ',
      'a': 'JJ',
      'r': 'RB',
      'v': 'VB'
    }

  def choose_tag(self, tokens, index, history):
    word = tokens[index]
    fd = FreqDist()

    for synset in wordnet.synsets(word):
      fd[synset.pos()] += 1

    return self.wordnet_tag_map.get(fd.max())

Tip

Another way the FreqDist API has changed between NLTK2 and NLTK3 is that the inc() method has been removed. Instead, you must use fd[key] += 1. Since FreqDist inherits from collections.Counter, it's ok if fd[key] doesn't exist the first time you increment.

How it works...

The WordNetTagger class simply counts the number of each part-of-speech tag found in the Synsets for a word. The most common tag is then mapped to a treebank tag using internal mapping. Here's some sample usage code:

>>> from taggers import WordNetTagger
>>> wn_tagger = WordNetTagger()
>>> wn_tagger.evaluate(train_sents)
0.17914876598160262

So, it's not too accurate, but that's to be expected. We only have enough information to produce four different kinds of tags, while there are 36 possible tags in treebank. There are many words that can have different part-of-speech tags depending on their context. But if we put the WordNetTagger class at the end of an NgramTagger backoff chain, then we can improve accuracy over the DefaultTagger class.

>>> from tag_util import backoff_tagger
>>> from nltk.tag import UnigramTagger, BigramTagger, TrigramTagger
>>> tagger = backoff_tagger(train_sents, [UnigramTagger, BigramTagger, TrigramTagger], backoff=wn_tagger)
>>> tagger.evaluate(test_sents)
0.8848262464925534

See also

The Looking up Synsets for a word in WordNet recipe in Chapter 1, Tokenizing Text and WordNet Basics, details how to use the wordnet corpus and what kinds of part-of-speech tags it knows about. And in the Combining taggers with backoff tagging and Training and combining ngram taggers recipes, we went over backoff tagging with ngram taggers.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset