If you remember from the Looking up Synsets for a word in WordNet recipe in Chapter 1, Tokenizing Text and WordNet Basics, WordNet Synsets specify a part-of-speech tag. It's a very restricted set of possible tags, and many words have multiple Synsets with different part-of-speech tags, but this information can be useful for tagging unknown words. WordNet is essentially a giant dictionary, and it's likely to contain many words that are not in your training data.
First, we need to decide how to map WordNet part-of-speech tags to the Penn Treebank part-of-speech tags we've been using. The following is a table mapping one to the other. See the Looking up Synsets for a word in WordNet recipe in Chapter 1, Tokenizing Text and WordNet Basics, for more details. The s
, which was not shown before, is just another kind of adjective, at least for tagging purposes.
WordNet tag |
Treebank tag |
---|---|
n |
NN |
a |
JJ |
s |
JJ |
r |
RB |
v |
VB |
Now we can create a class that will look up words in WordNet, and then choose the most common tag from the Synsets it finds. The WordNetTagger
class defined in the following code can be found in taggers.py
:
from nltk.tag import SequentialBackoffTagger from nltk.corpus import wordnet from nltk.probability import FreqDist class WordNetTagger(SequentialBackoffTagger): ''' >>> wt = WordNetTagger() >>> wt.tag(['food', 'is', 'great']) [('food', 'NN'), ('is', 'VB'), ('great', 'JJ')] ''' def __init__(self, *args, **kwargs): SequentialBackoffTagger.__init__(self, *args, **kwargs) self.wordnet_tag_map = { 'n': 'NN', 's': 'JJ', 'a': 'JJ', 'r': 'RB', 'v': 'VB' } def choose_tag(self, tokens, index, history): word = tokens[index] fd = FreqDist() for synset in wordnet.synsets(word): fd[synset.pos()] += 1 return self.wordnet_tag_map.get(fd.max())
The WordNetTagger
class simply counts the number of each part-of-speech tag found in the Synsets for a word. The most common tag is then mapped to a treebank
tag using internal mapping. Here's some sample usage code:
>>> from taggers import WordNetTagger >>> wn_tagger = WordNetTagger() >>> wn_tagger.evaluate(train_sents) 0.17914876598160262
So, it's not too accurate, but that's to be expected. We only have enough information to produce four different kinds of tags, while there are 36 possible tags in treebank
. There are many words that can have different part-of-speech tags depending on their context. But if we put the WordNetTagger
class at the end of an NgramTagger
backoff chain, then we can improve accuracy over the DefaultTagger
class.
>>> from tag_util import backoff_tagger >>> from nltk.tag import UnigramTagger, BigramTagger, TrigramTagger >>> tagger = backoff_tagger(train_sents, [UnigramTagger, BigramTagger, TrigramTagger], backoff=wn_tagger) >>> tagger.evaluate(test_sents) 0.8848262464925534
The Looking up Synsets for a word in WordNet recipe in Chapter 1, Tokenizing Text and WordNet Basics, details how to use the wordnet
corpus and what kinds of part-of-speech tags it knows about. And in the Combining taggers with backoff tagging and Training and combining ngram taggers recipes, we went over backoff tagging with ngram taggers.