Expanding and removing chunks with regular expressions

There are three RegexpChunkRule subclasses that are not supported by RegexpChunkRule.fromstring() or RegexpParser, and therefore must be created manually if you want to use them. These rules are as follows:

  • ExpandLeftRule: Add unchunked (chink) words to the left of a chunk
  • ExpandRightRule: Add unchunked (chink) words to the right of a chunk
  • UnChunkRule: Unchunk any matching chunk

How to do it...

ExpandLeftRule and ExpandRightRule both take two patterns along with a description as arguments. For ExpandLeftRule, the first pattern is the chink we want to add to the beginning of the chunk, while the right pattern will match the beginning of the chunk we want to expand. With ExpandRightRule, the left pattern should match the end of the chunk we want to expand, and the right pattern matches the chink we want to add to the end of the chunk. The idea is similar to the MergeRule class, but in this case, we're merging chink words instead of other chunks.

UnChunkRule is the opposite of ChunkRule. Any chunk that exactly matches the UnChunkRule pattern will be unchunked and become a chink. Here's some code demonstrating the usage with the RegexpChunkParser class:

>>> from nltk.chunk.regexp import ChunkRule, ExpandLeftRule, ExpandRightRule, UnChunkRule
>>> from nltk.chunk import RegexpChunkParser
>>> ur = ChunkRule('<NN>', 'single noun')
>>> el = ExpandLeftRule('<DT>', '<NN>', 'get left determiner')
>>> er = ExpandRightRule('<NN>', '<NNS>', 'get right plural noun')
>>> un = UnChunkRule('<DT><NN.*>*', 'unchunk everything')
>>> chunker = RegexpChunkParser([ur, el, er, un])
>>> sent = [('the', 'DT'), ('sushi', 'NN'), ('rolls', 'NNS')]
>>> chunker.parse(sent)
Tree('S', [('the', 'DT'), ('sushi', 'NN'), ('rolls', 'NNS')])

You'll notice that the end result is a flat sentence, which is exactly what we started with. That's because the final UnChunkRule undid the chunk created by the previous rules. Read on to see what happened step by step.

How it works...

The rules mentioned earlier were applied in the following order, starting with the sentence tree as follows:

How it works...
  1. Make single nouns into a chunk:
    How it works...
  2. Expand left determiners into chunks that begin with a noun:
    How it works...
  3. Expand right plural nouns into chunks that end with a noun, chunking the whole sentence as follows:
    How it works...
  4. Unchunk every chunk that is a determiner + noun + plural noun, resulting in the original sentence tree:
    How it works...

Here's the code showing each step:

>>> from nltk.chunk.regexp import ChunkString
>>> from nltk.tree import Tree
>>> cs = ChunkString(Tree('S', sent))
>>> cs
<ChunkString: '<DT><NN><NNS>'>
>>> ur.apply(cs)
>>> cs
<ChunkString: '<DT>{<NN>}<NNS>'>
>>> el.apply(cs)
>>> cs
<ChunkString: '{<DT><NN>}<NNS>'>
>>> er.apply(cs)
>>> cs
<ChunkString: '{<DT><NN><NNS>}'>
>>> un.apply(cs)
>>> cs
<ChunkString: '<DT><NN><NNS>'>

There's more...

In practice, you can probably get away with only using the previous four rules: ChunkRule, ChinkRule, MergeRule, and SplitRule. But if you do need very fine-grained control over chunk parsing and removing chunks, now you know how to do it with the expansion and unchunk rules.

See also

The previous two recipes covered the more common chunk rules that are supported by RegexpChunkRule.fromstring() and RegexpParser.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset