Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Tagging with regular expressions

You can use regular expression matching to tag words. For example, you can match numbers with d to assign the tag CD (which refers to a Cardinal number). Or you could match on known word patterns, such as the suffix "ing". There's a lot of flexibility here, but be careful of over-specifying since language is naturally inexact, and there are always exceptions to the rule.

Getting ready

For this recipe to make sense, you should be familiar with the regular expression syntax and Python's re module.

How to do it...

The RegexpTagger class expects a list of two tuples, where the first element in the tuple is a regular expression and the second element is the tag. The patterns shown in the following code can be found in tag_util.py:

patterns = [
  (r'^d+$', 'CD'),
  (r'.*ing$', 'VBG'), # gerunds, i.e. wondering
  (r'.*ment$', 'NN'), # i.e. wonderment
  (r'.*ful$', 'JJ') # i.e. wonderful
]

Once you've constructed this list of patterns, you can pass it into RegexpTagger.

>>> from tag_util import patterns
>>> from nltk.tag import RegexpTagger
>>> tagger = RegexpTagger(patterns)
>>> tagger.evaluate(test_sents)
0.037470321605870924

So, it's not too great with just a few patterns, but since RegexpTagger is a subclass of SequentialBackoffTagger, it can be a useful part of a backoff chain. For example, it could be positioned just before a DefaultTagger class, to tag words that the ngram tagger(s) missed.

How it works...

The RegexpTagger class saves the patterns given at initialization, then on each call to choose_tag(), it iterates over the patterns and returns the tag for the first expression that matches the current word using re.match(). This means that if you have two expressions that could match, the tag of the first one will always be returned, and the second expression won't even be tried.

There's more...

The RegexpTagger class can replace the DefaultTagger class if you give it a pattern such as (r'.*', 'NN'). This pattern should, of course, be last in the list of patterns, otherwise no other patterns will match.

Table of Contents for
Tagging with regular expressions

Tagging with regular expressions

Getting ready

How to do it...

How it works...

There's more...

See also

Table of Contents for Tagging with regular expressions

Create new playlist

Sign In

Sign Up

Tagging with regular expressions

Getting ready

How to do it...

How it works...

There's more...

See also

Table of Contents for
Tagging with regular expressions