Merging and splitting chunks with regular expressions

In this recipe, we'll cover two more rules for chunking. A MergeRule class can merge two chunks together based on the end of the first chunk and the beginning of the second chunk. A SplitRule class will split a chunk into two chunks based on the specified split pattern.

How to do it...

A SplitRule class is specified with two opposing curly braces surrounded by a pattern on either side. To split a chunk after a noun, you would do <NN.*>}{<.*>. A MergeRule class is specified by flipping the curly braces, and will join chunks where the end of the first chunk matches the left pattern and the beginning of the next chunk matches the right pattern. To merge two chunks where the first ends with a noun and the second begins with a noun, you'd use <NN.*>{}<NN.*>.

Note

Note that the order of rules is very important, and reordering can affect the results. The RegexpParser class applies the rules one at a time from top to bottom, so each rule will be applied to the ChunkString resulting from the previous rule.

An example of splitting and merging, starting with the sentence tree, is shown next:

How to do it...

The whole sentence is chunked, as shown in the following diagram:

How to do it...

The chunk is split into multiple chunks after every noun, as shown in the following tree:

How to do it...

Each chunk with a determiner is split into separate chunks, creating four chunks where there were three:

How to do it...

Chunks ending with a noun are merged with the next chunk if it begins with a noun, reducing the four chunks back down to three, as shown in the following diagram:

How to do it...

Using the RegexpParser class, the code looks like this:

>>> chunker = RegexpParser(r'''
... NP:
... {<DT><.*>*<NN.*>}
... <NN.*>}{<.*>
... <.*>}{<DT>
... <NN.*>{}<NN.*>
... ''')
>>> sent = [('the', 'DT'), ('sushi', 'NN'), ('roll', 'NN'), ('was', 'VBD'), ('filled', 'VBN'), ('with', 'IN'), ('the', 'DT'), ('fish', 'NN')]
>>> chunker.parse(sent)
Tree('S', [Tree('NP', [('the', 'DT'), ('sushi', 'NN'), ('roll', 'NN')]), Tree('NP', [('was', 'VBD'), ('filled', 'VBN'), ('with', 'IN')]), Tree('NP', [('the', 'DT'), ('fish', 'NN')])])

And the final tree of NP chunks is shown in the following diagram:

How to do it...

How it works...

The MergeRule and SplitRule classes take two arguments: the left pattern and the right pattern. The RegexpParser class takes care of splitting the original patterns on the curly braces to get the left and right sides, but you can also create these manually. Here's a step-by-step walkthrough of how the original sentence is modified by applying each rule:

>>> from nltk.chunk.regexp import MergeRule, SplitRule
>>> cs = ChunkString(Tree('S', sent))
>>> cs
<ChunkString: '<DT><NN><NN><VBD><VBN><IN><DT><NN>'>
>>> ur = ChunkRule('<DT><.*>*<NN.*>', 'chunk determiner to noun')
>>> ur.apply(cs)
>>> cs
<ChunkString: '{<DT><NN><NN><VBD><VBN><IN><DT><NN>}'>
>>> sr1 = SplitRule('<NN.*>', '<.*>', 'split after noun')
>>> sr1.apply(cs)
>>> cs
<ChunkString: '{<DT><NN>}{<NN>}{<VBD><VBN><IN><DT><NN>}'>
>>> sr2 = SplitRule('<.*>', '<DT>', 'split before determiner')
>>> sr2.apply(cs)
>>> cs
<ChunkString: '{<DT><NN>}{<NN>}{<VBD><VBN><IN>}{<DT><NN>}'>
>>> mr = MergeRule('<NN.*>', '<NN.*>', 'merge nouns')
>>> mr.apply(cs)
>>> cs
<ChunkString: '{<DT><NN><NN>}{<VBD><VBN><IN>}{<DT><NN>}'>
>>> cs.to_chunkstruct()
Tree('S', [Tree('CHUNK', [('the', 'DT'), ('sushi', 'NN'), ('roll', 'NN')]), Tree('CHUNK', [('was', 'VBD'), ('filled', 'VBN'), ('with', 'IN')]), Tree('CHUNK', [('the', 'DT'), ('fish', 'NN')])])

There's more...

The parsing of the rules and splitting of left and right patterns is done in the static parse() method of the RegexpChunkRule superclass. This is called by the RegexpParser class to get the list of rules to pass into the RegexpChunkParser class. Here are some examples of parsing the patterns we used earlier:

>>> from nltk.chunk.regexp import RegexpChunkRule
>>> RegexpChunkRule.fromstring('{<DT><.*>*<NN.*>}')
<ChunkRule: '<DT><.*>*<NN.*>'>
>>> RegexpChunkRule.fromstring('<.*>}{<DT>')
<SplitRule: '<.*>', '<DT>'>
>>> RegexpChunkRule.fromstring('<NN.*>{}<NN.*>')
<MergeRule: '<NN.*>', '<NN.*>'>

Specifying rule descriptions

Descriptions for each rule can be specified with a comment string after the rule (a comment string must start with #). If no comment string is found, the rule's description will be empty. Here's an example:

>>> RegexpChunkRule.fromstring('{<DT><.*>*<NN.*>} # chunk everything').descr()
'chunk everything'
>>> RegexpChunkRule.fromstring('{<DT><.*>*<NN.*>}').descr()
''

Comment string descriptions can also be used within grammar strings that are passed to RegexpParser.

See also

The previous recipe goes over how to use ChunkRule, and how rules are passed into RegexpChunkParser.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset