In this recipe, we'll cover two more rules for chunking. A MergeRule
class can merge two chunks together based on the end of the first chunk and the beginning of the second chunk. A SplitRule
class will split a chunk into two chunks based on the specified split pattern.
A SplitRule
class is specified with two opposing curly braces surrounded by a pattern on either side. To split a chunk after a noun, you would do <NN.*>}{<.*>
. A MergeRule
class is specified by flipping the curly braces, and will join chunks where the end of the first chunk matches the left pattern and the beginning of the next chunk matches the right pattern. To merge two chunks where the first ends with a noun and the second begins with a noun, you'd use <NN.*>{}<NN.*>
.
An example of splitting and merging, starting with the sentence tree, is shown next:
The whole sentence is chunked, as shown in the following diagram:
The chunk is split into multiple chunks after every noun, as shown in the following tree:
Each chunk with a determiner is split into separate chunks, creating four chunks where there were three:
Chunks ending with a noun are merged with the next chunk if it begins with a noun, reducing the four chunks back down to three, as shown in the following diagram:
Using the RegexpParser
class, the code looks like this:
>>> chunker = RegexpParser(r''' ... NP: ... {<DT><.*>*<NN.*>} ... <NN.*>}{<.*> ... <.*>}{<DT> ... <NN.*>{}<NN.*> ... ''') >>> sent = [('the', 'DT'), ('sushi', 'NN'), ('roll', 'NN'), ('was', 'VBD'), ('filled', 'VBN'), ('with', 'IN'), ('the', 'DT'), ('fish', 'NN')] >>> chunker.parse(sent) Tree('S', [Tree('NP', [('the', 'DT'), ('sushi', 'NN'), ('roll', 'NN')]), Tree('NP', [('was', 'VBD'), ('filled', 'VBN'), ('with', 'IN')]), Tree('NP', [('the', 'DT'), ('fish', 'NN')])])
And the final tree of NP
chunks is shown in the following diagram:
The MergeRule
and SplitRule
classes take two arguments: the left pattern and the right pattern. The RegexpParser
class takes care of splitting the original patterns on the curly braces to get the left and right sides, but you can also create these manually. Here's a step-by-step walkthrough of how the original sentence is modified by applying each rule:
>>> from nltk.chunk.regexp import MergeRule, SplitRule >>> cs = ChunkString(Tree('S', sent)) >>> cs <ChunkString: '<DT><NN><NN><VBD><VBN><IN><DT><NN>'> >>> ur = ChunkRule('<DT><.*>*<NN.*>', 'chunk determiner to noun') >>> ur.apply(cs) >>> cs <ChunkString: '{<DT><NN><NN><VBD><VBN><IN><DT><NN>}'> >>> sr1 = SplitRule('<NN.*>', '<.*>', 'split after noun') >>> sr1.apply(cs) >>> cs <ChunkString: '{<DT><NN>}{<NN>}{<VBD><VBN><IN><DT><NN>}'> >>> sr2 = SplitRule('<.*>', '<DT>', 'split before determiner') >>> sr2.apply(cs) >>> cs <ChunkString: '{<DT><NN>}{<NN>}{<VBD><VBN><IN>}{<DT><NN>}'> >>> mr = MergeRule('<NN.*>', '<NN.*>', 'merge nouns') >>> mr.apply(cs) >>> cs <ChunkString: '{<DT><NN><NN>}{<VBD><VBN><IN>}{<DT><NN>}'> >>> cs.to_chunkstruct() Tree('S', [Tree('CHUNK', [('the', 'DT'), ('sushi', 'NN'), ('roll', 'NN')]), Tree('CHUNK', [('was', 'VBD'), ('filled', 'VBN'), ('with', 'IN')]), Tree('CHUNK', [('the', 'DT'), ('fish', 'NN')])])
The parsing of the rules and splitting of left and right patterns is done in the static parse()
method of the RegexpChunkRule
superclass. This is called by the RegexpParser
class to get the list of rules to pass into the RegexpChunkParser
class. Here are some examples of parsing the patterns we used earlier:
>>> from nltk.chunk.regexp import RegexpChunkRule >>> RegexpChunkRule.fromstring('{<DT><.*>*<NN.*>}') <ChunkRule: '<DT><.*>*<NN.*>'> >>> RegexpChunkRule.fromstring('<.*>}{<DT>') <SplitRule: '<.*>', '<DT>'> >>> RegexpChunkRule.fromstring('<NN.*>{}<NN.*>') <MergeRule: '<NN.*>', '<NN.*>'>
Descriptions for each rule can be specified with a comment string after the rule (a comment string must start with #
). If no comment string is found, the rule's description will be empty. Here's an example:
>>> RegexpChunkRule.fromstring('{<DT><.*>*<NN.*>} # chunk everything').descr() 'chunk everything' >>> RegexpChunkRule.fromstring('{<DT><.*>*<NN.*>}').descr() ''
Comment string descriptions can also be used within grammar strings that are passed to RegexpParser
.