You can use regular expression matching to tag words. For example, you can match numbers with d
to assign the tag CD (which refers to a Cardinal number). Or you could match on known word patterns, such as the suffix "ing". There's a lot of flexibility here, but be careful of over-specifying since language is naturally inexact, and there are always exceptions to the rule.
For this recipe to make sense, you should be familiar with the regular expression syntax and Python's re
module.
The RegexpTagger
class expects a list of two tuples, where the first element in the tuple is a regular expression and the second element is the tag. The patterns shown in the following code can be found in tag_util.py
:
patterns = [ (r'^d+$', 'CD'), (r'.*ing$', 'VBG'), # gerunds, i.e. wondering (r'.*ment$', 'NN'), # i.e. wonderment (r'.*ful$', 'JJ') # i.e. wonderful ]
Once you've constructed this list of patterns, you can pass it into RegexpTagger
.
>>> from tag_util import patterns >>> from nltk.tag import RegexpTagger >>> tagger = RegexpTagger(patterns) >>> tagger.evaluate(test_sents) 0.037470321605870924
So, it's not too great with just a few patterns, but since RegexpTagger
is a subclass of SequentialBackoffTagger
, it can be a useful part of a backoff chain. For example, it could be positioned just before a DefaultTagger
class, to tag words that the ngram tagger(s) missed.
The RegexpTagger
class saves the patterns given at initialization, then on each call to choose_tag()
, it iterates over the patterns and returns the tag for the first expression that matches the current word using re.match()
. This means that if you have two expressions that could match, the tag of the first one will always be returned, and the second expression won't even be tried.
The RegexpTagger
class can replace the DefaultTagger
class if you give it a pattern such as (r'.*', 'NN')
. This pattern should, of course, be last in the list of patterns, otherwise no other patterns will match.