Tagging with regular expressions

You can use regular expression matching to tag words. For example, you can match numbers with d to assign the tag CD (which refers to a Cardinal number). Or you could match on known word patterns, such as the suffix "ing". There's a lot of flexibility here, but be careful of over-specifying since language is naturally inexact, and there are always exceptions to the rule.

Getting ready

For this recipe to make sense, you should be familiar with the regular expression syntax and Python's re module.

How to do it...

The RegexpTagger class expects a list of two tuples, where the first element in the tuple is a regular expression and the second element is the tag. The patterns shown in the following code can be found in tag_util.py:

patterns = [
  (r'^d+$', 'CD'),
  (r'.*ing$', 'VBG'), # gerunds, i.e. wondering
  (r'.*ment$', 'NN'), # i.e. wonderment
  (r'.*ful$', 'JJ') # i.e. wonderful
]

Once you've constructed this list of patterns, you can pass it into RegexpTagger.

>>> from tag_util import patterns
>>> from nltk.tag import RegexpTagger
>>> tagger = RegexpTagger(patterns)
>>> tagger.evaluate(test_sents)
0.037470321605870924

So, it's not too great with just a few patterns, but since RegexpTagger is a subclass of SequentialBackoffTagger, it can be a useful part of a backoff chain. For example, it could be positioned just before a DefaultTagger class, to tag words that the ngram tagger(s) missed.

How it works...

The RegexpTagger class saves the patterns given at initialization, then on each call to choose_tag(), it iterates over the patterns and returns the tag for the first expression that matches the current word using re.match(). This means that if you have two expressions that could match, the tag of the first one will always be returned, and the second expression won't even be tried.

There's more...

The RegexpTagger class can replace the DefaultTagger class if you give it a pattern such as (r'.*', 'NN'). This pattern should, of course, be last in the list of patterns, otherwise no other patterns will match.

See also

In the next recipe, we'll cover the AffixTagger class, which learns how to tag based on prefixes and suffixes of words. See the Default tagging recipe for details on the DefaultTagger class.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset