Using the included names
corpus, we can create a simple tagger for tagging names as proper nouns.
The NamesTagger
class is a subclass of SequentialBackoffTagger
as it's probably only useful near the end of a backoff chain. At initialization, we create a set of all names in the names
corpus, lower-casing each name to make lookup easier. Then, we implement the choose_tag()
method, which simply checks whether the current word is in the names_set
list. If it is, we return the NNP
tag (which is the tag for proper nouns). If it isn't, we return None
, so the next tagger in the chain can tag the word. The following code can be found in taggers.py
:
from nltk.tag import SequentialBackoffTagger from nltk.corpus import names class NamesTagger(SequentialBackoffTagger): def __init__(self, *args, **kwargs): SequentialBackoffTagger.__init__(self, *args, **kwargs) self.name_set = set([n.lower() for n in names.words()]) def choose_tag(self, tokens, index, history): word = tokens[index] if word.lower() in self.name_set: return 'NNP' else: return None
The NamesTagger
class should be pretty self-explanatory. The usage is also simple.
>>> from taggers import NamesTagger >>> nt = NamesTagger() >>> nt.tag(['Jacob']) [('Jacob', 'NNP')]
It's probably best to use the NamesTagger
class right before a DefaultTagger
class, so it's at the end of a backoff chain. But it could probably go anywhere in the chain since it's unlikely to mis-tag a word.