Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Merging and splitting chunks with regular expressions

In this recipe, we'll cover two more rules for chunking. A MergeRule class can merge two chunks together based on the end of the first chunk and the beginning of the second chunk. A SplitRule class will split a chunk into two chunks based on the specified split pattern.

How to do it...

A SplitRule class is specified with two opposing curly braces surrounded by a pattern on either side. To split a chunk after a noun, you would do <NN.*>}{<.*>. A MergeRule class is specified by flipping the curly braces, and will join chunks where the end of the first chunk matches the left pattern and the beginning of the next chunk matches the right pattern. To merge two chunks where the first ends with a noun and the second begins with a noun, you'd use <NN.*>{}<NN.*>.

Note

Note that the order of rules is very important, and reordering can affect the results. The RegexpParser class applies the rules one at a time from top to bottom, so each rule will be applied to the ChunkString resulting from the previous rule.

An example of splitting and merging, starting with the sentence tree, is shown next:

The whole sentence is chunked, as shown in the following diagram:

The chunk is split into multiple chunks after every noun, as shown in the following tree:

Each chunk with a determiner is split into separate chunks, creating four chunks where there were three:

Chunks ending with a noun are merged with the next chunk if it begins with a noun, reducing the four chunks back down to three, as shown in the following diagram:

Using the RegexpParser class, the code looks like this:

>>> chunker = RegexpParser(r'''
... NP:
... {<DT><.*>*<NN.*>}
... <NN.*>}{<.*>
... <.*>}{<DT>
... <NN.*>{}<NN.*>
... ''')
>>> sent = [('the', 'DT'), ('sushi', 'NN'), ('roll', 'NN'), ('was', 'VBD'), ('filled', 'VBN'), ('with', 'IN'), ('the', 'DT'), ('fish', 'NN')]
>>> chunker.parse(sent)
Tree('S', [Tree('NP', [('the', 'DT'), ('sushi', 'NN'), ('roll', 'NN')]), Tree('NP', [('was', 'VBD'), ('filled', 'VBN'), ('with', 'IN')]), Tree('NP', [('the', 'DT'), ('fish', 'NN')])])

And the final tree of NP chunks is shown in the following diagram:

How it works...

The MergeRule and SplitRule classes take two arguments: the left pattern and the right pattern. The RegexpParser class takes care of splitting the original patterns on the curly braces to get the left and right sides, but you can also create these manually. Here's a step-by-step walkthrough of how the original sentence is modified by applying each rule:

>>> from nltk.chunk.regexp import MergeRule, SplitRule
>>> cs = ChunkString(Tree('S', sent))
>>> cs
<ChunkString: '<DT><NN><NN><VBD><VBN><IN><DT><NN>'>
>>> ur = ChunkRule('<DT><.*>*<NN.*>', 'chunk determiner to noun')
>>> ur.apply(cs)
>>> cs
<ChunkString: '{<DT><NN><NN><VBD><VBN><IN><DT><NN>}'>
>>> sr1 = SplitRule('<NN.*>', '<.*>', 'split after noun')
>>> sr1.apply(cs)
>>> cs
<ChunkString: '{<DT><NN>}{<NN>}{<VBD><VBN><IN><DT><NN>}'>
>>> sr2 = SplitRule('<.*>', '<DT>', 'split before determiner')
>>> sr2.apply(cs)
>>> cs
<ChunkString: '{<DT><NN>}{<NN>}{<VBD><VBN><IN>}{<DT><NN>}'>
>>> mr = MergeRule('<NN.*>', '<NN.*>', 'merge nouns')
>>> mr.apply(cs)
>>> cs
<ChunkString: '{<DT><NN><NN>}{<VBD><VBN><IN>}{<DT><NN>}'>
>>> cs.to_chunkstruct()
Tree('S', [Tree('CHUNK', [('the', 'DT'), ('sushi', 'NN'), ('roll', 'NN')]), Tree('CHUNK', [('was', 'VBD'), ('filled', 'VBN'), ('with', 'IN')]), Tree('CHUNK', [('the', 'DT'), ('fish', 'NN')])])

There's more...

The parsing of the rules and splitting of left and right patterns is done in the static parse() method of the RegexpChunkRule superclass. This is called by the RegexpParser class to get the list of rules to pass into the RegexpChunkParser class. Here are some examples of parsing the patterns we used earlier:

>>> from nltk.chunk.regexp import RegexpChunkRule
>>> RegexpChunkRule.fromstring('{<DT><.*>*<NN.*>}')
<ChunkRule: '<DT><.*>*<NN.*>'>
>>> RegexpChunkRule.fromstring('<.*>}{<DT>')
<SplitRule: '<.*>', '<DT>'>
>>> RegexpChunkRule.fromstring('<NN.*>{}<NN.*>')
<MergeRule: '<NN.*>', '<NN.*>'>

Specifying rule descriptions

Descriptions for each rule can be specified with a comment string after the rule (a comment string must start with #). If no comment string is found, the rule's description will be empty. Here's an example:

>>> RegexpChunkRule.fromstring('{<DT><.*>*<NN.*>} # chunk everything').descr()
'chunk everything'
>>> RegexpChunkRule.fromstring('{<DT><.*>*<NN.*>}').descr()
''

Comment string descriptions can also be used within grammar strings that are passed to RegexpParser.

Table of Contents for
Merging and splitting chunks with regular expressions

Merging and splitting chunks with regular expressions

How to do it...

Note

How it works...

There's more...

Specifying rule descriptions

See also

Table of Contents for Merging and splitting chunks with regular expressions

Create new playlist

Sign In

Sign Up

Merging and splitting chunks with regular expressions

How to do it...

Note

How it works...

There's more...

Specifying rule descriptions

See also

Table of Contents for
Merging and splitting chunks with regular expressions