Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Expanding and removing chunks with regular expressions

There are three RegexpChunkRule subclasses that are not supported by RegexpChunkRule.fromstring() or RegexpParser, and therefore must be created manually if you want to use them. These rules are as follows:

ExpandLeftRule: Add unchunked (chink) words to the left of a chunk
ExpandRightRule: Add unchunked (chink) words to the right of a chunk
UnChunkRule: Unchunk any matching chunk

How to do it...

ExpandLeftRule and ExpandRightRule both take two patterns along with a description as arguments. For ExpandLeftRule, the first pattern is the chink we want to add to the beginning of the chunk, while the right pattern will match the beginning of the chunk we want to expand. With ExpandRightRule, the left pattern should match the end of the chunk we want to expand, and the right pattern matches the chink we want to add to the end of the chunk. The idea is similar to the MergeRule class, but in this case, we're merging chink words instead of other chunks.

UnChunkRule is the opposite of ChunkRule. Any chunk that exactly matches the UnChunkRule pattern will be unchunked and become a chink. Here's some code demonstrating the usage with the RegexpChunkParser class:

>>> from nltk.chunk.regexp import ChunkRule, ExpandLeftRule, ExpandRightRule, UnChunkRule
>>> from nltk.chunk import RegexpChunkParser
>>> ur = ChunkRule('<NN>', 'single noun')
>>> el = ExpandLeftRule('<DT>', '<NN>', 'get left determiner')
>>> er = ExpandRightRule('<NN>', '<NNS>', 'get right plural noun')
>>> un = UnChunkRule('<DT><NN.*>*', 'unchunk everything')
>>> chunker = RegexpChunkParser([ur, el, er, un])
>>> sent = [('the', 'DT'), ('sushi', 'NN'), ('rolls', 'NNS')]
>>> chunker.parse(sent)
Tree('S', [('the', 'DT'), ('sushi', 'NN'), ('rolls', 'NNS')])

You'll notice that the end result is a flat sentence, which is exactly what we started with. That's because the final UnChunkRule undid the chunk created by the previous rules. Read on to see what happened step by step.

How it works...

The rules mentioned earlier were applied in the following order, starting with the sentence tree as follows:

Make single nouns into a chunk:
Expand left determiners into chunks that begin with a noun:
Expand right plural nouns into chunks that end with a noun, chunking the whole sentence as follows:
Unchunk every chunk that is a determiner + noun + plural noun, resulting in the original sentence tree:

Here's the code showing each step:

>>> from nltk.chunk.regexp import ChunkString
>>> from nltk.tree import Tree
>>> cs = ChunkString(Tree('S', sent))
>>> cs
<ChunkString: '<DT><NN><NNS>'>
>>> ur.apply(cs)
>>> cs
<ChunkString: '<DT>{<NN>}<NNS>'>
>>> el.apply(cs)
>>> cs
<ChunkString: '{<DT><NN>}<NNS>'>
>>> er.apply(cs)
>>> cs
<ChunkString: '{<DT><NN><NNS>}'>
>>> un.apply(cs)
>>> cs
<ChunkString: '<DT><NN><NNS>'>

There's more...

In practice, you can probably get away with only using the previous four rules: ChunkRule, ChinkRule, MergeRule, and SplitRule. But if you do need very fine-grained control over chunk parsing and removing chunks, now you know how to do it with the expansion and unchunk rules.

Table of Contents for
Expanding and removing chunks with regular expressions

Expanding and removing chunks with regular expressions

How to do it...

How it works...

There's more...

See also

Table of Contents for Expanding and removing chunks with regular expressions

Create new playlist

Sign In

Sign Up

Expanding and removing chunks with regular expressions

How to do it...

How it works...

There's more...

See also

Table of Contents for
Expanding and removing chunks with regular expressions