There are three RegexpChunkRule
subclasses that are not supported by RegexpChunkRule.fromstring()
or RegexpParser
, and therefore must be created manually if you want to use them. These rules are as follows:
ExpandLeftRule
: Add unchunked (chink) words to the left of a chunkExpandRightRule
: Add unchunked (chink) words to the right of a chunkUnChunkRule
: Unchunk any matching chunk
ExpandLeftRule
and ExpandRightRule
both take two patterns along with a description as arguments. For ExpandLeftRule
, the first pattern is the chink we want to add to the beginning of the chunk, while the right pattern will match the beginning of the chunk we want to expand. With ExpandRightRule
, the left pattern should match the end of the chunk we want to expand, and the right pattern matches the chink we want to add to the end of the chunk. The idea is similar to the MergeRule
class, but in this case, we're merging chink words instead of other chunks.
UnChunkRule
is the opposite of
ChunkRule
. Any chunk that exactly matches the UnChunkRule
pattern will be unchunked and become a chink. Here's some code demonstrating the usage with the RegexpChunkParser
class:
>>> from nltk.chunk.regexp import ChunkRule, ExpandLeftRule, ExpandRightRule, UnChunkRule >>> from nltk.chunk import RegexpChunkParser >>> ur = ChunkRule('<NN>', 'single noun') >>> el = ExpandLeftRule('<DT>', '<NN>', 'get left determiner') >>> er = ExpandRightRule('<NN>', '<NNS>', 'get right plural noun') >>> un = UnChunkRule('<DT><NN.*>*', 'unchunk everything') >>> chunker = RegexpChunkParser([ur, el, er, un]) >>> sent = [('the', 'DT'), ('sushi', 'NN'), ('rolls', 'NNS')] >>> chunker.parse(sent) Tree('S', [('the', 'DT'), ('sushi', 'NN'), ('rolls', 'NNS')])
You'll notice that the end result is a flat sentence, which is exactly what we started with. That's because the final UnChunkRule
undid the chunk created by the previous rules. Read on to see what happened step by step.
The rules mentioned earlier were applied in the following order, starting with the sentence tree as follows:
Here's the code showing each step:
>>> from nltk.chunk.regexp import ChunkString >>> from nltk.tree import Tree >>> cs = ChunkString(Tree('S', sent)) >>> cs <ChunkString: '<DT><NN><NNS>'> >>> ur.apply(cs) >>> cs <ChunkString: '<DT>{<NN>}<NNS>'> >>> el.apply(cs) >>> cs <ChunkString: '{<DT><NN>}<NNS>'> >>> er.apply(cs) >>> cs <ChunkString: '{<DT><NN><NNS>}'> >>> un.apply(cs) >>> cs <ChunkString: '<DT><NN><NNS>'>