Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Using WordNet for tagging

If you remember from the Looking up Synsets for a word in WordNet recipe in Chapter 1, Tokenizing Text and WordNet Basics, WordNet Synsets specify a part-of-speech tag. It's a very restricted set of possible tags, and many words have multiple Synsets with different part-of-speech tags, but this information can be useful for tagging unknown words. WordNet is essentially a giant dictionary, and it's likely to contain many words that are not in your training data.

Getting ready

First, we need to decide how to map WordNet part-of-speech tags to the Penn Treebank part-of-speech tags we've been using. The following is a table mapping one to the other. See the Looking up Synsets for a word in WordNet recipe in Chapter 1, Tokenizing Text and WordNet Basics, for more details. The s, which was not shown before, is just another kind of adjective, at least for tagging purposes.

WordNet tag	Treebank tag
n	NN
a	JJ
s	JJ
r	RB
v	VB

How to do it...

Now we can create a class that will look up words in WordNet, and then choose the most common tag from the Synsets it finds. The WordNetTagger class defined in the following code can be found in taggers.py:

from nltk.tag import SequentialBackoffTagger
from nltk.corpus import wordnet
from nltk.probability import FreqDist

class WordNetTagger(SequentialBackoffTagger):
  '''
  >>> wt = WordNetTagger()
  >>> wt.tag(['food', 'is', 'great'])
  [('food', 'NN'), ('is', 'VB'), ('great', 'JJ')]
  '''
  def __init__(self, *args, **kwargs):
    SequentialBackoffTagger.__init__(self, *args, **kwargs)

    self.wordnet_tag_map = {
      'n': 'NN',
      's': 'JJ',
      'a': 'JJ',
      'r': 'RB',
      'v': 'VB'
    }

  def choose_tag(self, tokens, index, history):
    word = tokens[index]
    fd = FreqDist()

    for synset in wordnet.synsets(word):
      fd[synset.pos()] += 1

    return self.wordnet_tag_map.get(fd.max())

Tip

Another way the FreqDist API has changed between NLTK2 and NLTK3 is that the inc() method has been removed. Instead, you must use fd[key] += 1. Since FreqDist inherits from collections.Counter, it's ok if fd[key] doesn't exist the first time you increment.

How it works...

The WordNetTagger class simply counts the number of each part-of-speech tag found in the Synsets for a word. The most common tag is then mapped to a treebank tag using internal mapping. Here's some sample usage code:

>>> from taggers import WordNetTagger
>>> wn_tagger = WordNetTagger()
>>> wn_tagger.evaluate(train_sents)
0.17914876598160262

So, it's not too accurate, but that's to be expected. We only have enough information to produce four different kinds of tags, while there are 36 possible tags in treebank. There are many words that can have different part-of-speech tags depending on their context. But if we put the WordNetTagger class at the end of an NgramTagger backoff chain, then we can improve accuracy over the DefaultTagger class.

>>> from tag_util import backoff_tagger
>>> from nltk.tag import UnigramTagger, BigramTagger, TrigramTagger
>>> tagger = backoff_tagger(train_sents, [UnigramTagger, BigramTagger, TrigramTagger], backoff=wn_tagger)
>>> tagger.evaluate(test_sents)
0.8848262464925534

Table of Contents for
Using WordNet for tagging

Using WordNet for tagging

Getting ready

How to do it...

Tip

How it works...

See also

Table of Contents for Using WordNet for tagging

Create new playlist

Sign In

Sign Up

Using WordNet for tagging

Getting ready

How to do it...

Tip

How it works...

See also

Table of Contents for
Using WordNet for tagging