Chunking and chinking with regular expressions

Using modified regular expressions, we can define chunk patterns. These are patterns of part-of-speech tags that define what kinds of words make up a chunk. We can also define patterns for what kinds of words should not be in a chunk. These unchunked words are known as chinks.

A ChunkRule class specifies what to include in a chunk, while a ChinkRule class specifies what to exclude from a chunk. In other words, chunking creates chunks, while chinking breaks up those chunks.

Getting ready

We first need to know how to define chunk patterns. These are modified regular expressions designed to match sequences of part-of-speech tags. An individual tag is specified by surrounding angle brackets, such as <NN> to match a noun tag. Multiple tags can then be combined, as in <DT><NN> to match a determiner followed by a noun. Regular expression syntax can be used within the angle brackets to match individual tag patterns, so you can do <NN.*> to match all nouns including NN and NNS. You can also use regular expression syntax outside of the angle brackets to match patterns of tags. <DT>?<NN.*>+ will match an optional determiner followed by one or more nouns. The chunk patterns are internally converted to regular expressions using the tag_pattern2re_pattern() function.

>>> from nltk.chunk.regexp import tag_pattern2re_pattern
>>> tag_pattern2re_pattern('<DT>?<NN.*>+')
'(<(DT)>)?(<(NN[^\{\}<>]*)>)+'

You don't have to use this function to do chunking, but it might be useful or interesting to see how your chunk patterns convert to regular expressions. This function is used by the RegexpParser class (explained in the next section) to convert chunk patterns into regular expressions to match chunking rules.

How to do it...

The pattern for specifying a chunk is to use surrounding curly braces, such as {<DT><NN>}. To specify a chink, you flip the braces, such as }<VB>{. These rules can be combined into a grammar for a particular phrase type. Here's a grammar for noun phrases that combines both a chunk and a chink pattern, along with the result of parsing the sentence the book has many chapters:

>>> from nltk.chunk import RegexpParser
>>> chunker = RegexpParser(r'''
... NP:
...  {<DT><NN.*><.*>*<NN.*>}
...  }<VB.*>{
... ''')
>>> chunker.parse([('the', 'DT'), ('book', 'NN'), ('has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')])
Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])

The grammar tells the RegexpParser class that there are two rules for parsing NP chunks. The first chunk pattern says that a chunk starts with a determiner followed by any kind of noun. Then, any number of other words are allowed until a final noun is found. The second pattern says that verbs should be chinked, thus separating any large chunks that contain a verb. The result is a tree with two noun-phrase chunks: the book and many chapters.

Note

Tagged sentences are always parsed into a Tree (found in the nltk.tree module). The top label of the Tree is S, which stands for sentence. Any chunks found will be subtrees whose labels will refer to the chunk type. In this case, the chunk type is NP for noun-phrase chunks. Trees can be drawn calling the draw() method using t.draw().

How it works...

Here's what happens, step-by-step:

  1. The sentence is converted into a flat Tree:
    How it works...
  2. The Tree is used to create a ChunkString.
  3. The RegexpParser parses the grammar to create a NP RegexpChunkParser with the given rules.
  4. A ChunkRule is created and applied to the ChunkString, which matches the entire sentence into a chunk:
    How it works...
  5. A ChinkRule is created and applied to the same ChunkString, which splits the big chunk into two smaller chunks with a verb between them:
    How it works...
  6. The ChunkString is converted back to a Tree, now with two NP chunk subtrees:
    How it works...

You can do this yourself using the classes in nltk.chunk.regexp. The ChunkRule and ChinkRule classes are both subclasses of RegexpChunkRule, and require two arguments: the pattern and a description of the rule. ChunkString is an object that starts with a flat tree, which is then modified by each rule when it is passed into the rule's apply() method. A ChunkString is converted back to a Tree with the to_chunkstruct() method. Here's some code to demonstrate this:

>>> from nltk.chunk.regexp import ChunkString, ChunkRule, ChinkRule
>>> from nltk.tree import Tree
>>> t = Tree('S', [('the', 'DT'), ('book', 'NN'), ('has', 'VBZ'), 
'many', 'JJ'), ('chapters', 'NNS')])
>>> cs = ChunkString(t)
>>> cs
<ChunkString: '<DT><NN><VBZ><JJ><NNS>'>
>>> ur = ChunkRule('<DT><NN.*><.*>*<NN.*>', 'chunk determiners and nouns')
>>> ur.apply(cs)
>>> cs
<ChunkString: '{<DT><NN><VBZ><JJ><NNS>}'>
>>> ir = ChinkRule('<VB.*>', 'chink verbs')
>>> ir.apply(cs)
>>> cs
<ChunkString: '{<DT><NN>}<VBZ>{<JJ><NNS>}'>
>>> cs.to_chunkstruct()
Tree('S', [Tree('CHUNK', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('CHUNK', [('many', 'JJ'), ('chapters', 'NNS')])])

The tree diagrams shown earlier can be drawn at each step by calling cs.to_chunkstruct().draw().

There's more...

You'll notice that the subtrees from the ChunkString class are tagged as CHUNK and not NP. That's because the rules mentioned earlier are phrase agnostic; they create chunks without needing to know what kind of chunks they are.

Internally, the RegexpParser class creates a RegexpChunkParser for each chunk phrase type. So, if you're only chunking NP phrases, there will only be one RegexpChunkParser. The RegexpChunkParser class gets all the rules for the specific chunk type, and handles applying the rules in order and converting the CHUNK trees to the specific chunk type, such as NP.

Here's some code to illustrate the usage of RegexpChunkParser. We pass both the rules mentioned earlier into the RegexpChunkParser class, and then parse the same sentence tree we created before. The resulting tree is just like what we got from applying both rules in order, except that CHUNK has been replaced with NP in both the subtrees. This is because RegexpChunkParser defaults to chunk_label='NP'.

>>> from nltk.chunk import RegexpChunkParser
>>> chunker = RegexpChunkParser([ur, ir])
>>> chunker.parse(t)
Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])

Parsing different chunk types

If you wanted to parse a different chunk type, then you could pass that in as chunk_label to RegexpChunkParser. Here's the same code that we saw in the previous section, but instead of NP subtrees, we'll call them CP for custom phrase:

>>> from nltk.chunk import RegexpChunkParser
>>> chunker = RegexpChunkParser([ur, ir], chunk_label='CP')
>>> chunker.parse(t)
Tree('S', [Tree('CP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('CP', [('many', 'JJ'), ('chapters', 'NNS')])])

The RegexpParser class does this internally when you specify multiple phrase types. This will be covered in the Partial parsing with regular expressions recipe.

Parsing alternative patterns

The same parsing results can be obtained using two chunk patterns in the grammar and discarding the chink pattern:

>>> chunker = RegexpParser(r'''
... NP:
... {<DT><NN.*>}
... {<JJ><NN.*>}
... ''')
>>> chunker.parse(t)
Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])

In fact, you could reduce the two chunk patterns into a single pattern.

>>> chunker = RegexpParser(r'''
... NP:
... {(<DT>|<JJ>)<NN.*>}
... ''')
>>> chunker.parse(t)
Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])

How you create and combine patterns is really up to you. Pattern creation is a process of trial and error, and entirely depends on what your data looks like and which patterns are easiest to express.

Chunk rule with context

You can also create chunk rules with a surrounding tag context. For example, if your pattern is <DT>{<NN>}, that will be parsed into a ChunkRuleWithContext class. So, context in this case is referring to the parts of the rule that are not chinks or chunks, such as <DT>. For example, in the phrase the dog, the would be context to the noun dog. Any time there's a tag on either side of the curly braces, you'll get a ChunkRuleWithContext class instead of a ChunkRule class. This can allow you to be more specific about when to parse particular kinds of chunks.

Here's an example of using ChunkRuleWithContext directly. It takes four arguments: the left context, the pattern to chunk, the right context, and a description:

>>> from nltk.chunk.regexp import ChunkRuleWithContext
>>> ctx = ChunkRuleWithContext('<DT>', '<NN.*>', '<.*>', 'chunk nouns only after determiners')
>>> cs = ChunkString(t)
>>> cs
<ChunkString: '<DT><NN><VBZ><JJ><NNS>'>
>>> ctx.apply(cs)
>>> cs
<ChunkString: '<DT>{<NN>}<VBZ><JJ><NNS>'>
>>> cs.to_chunkstruct()
Tree('S', [('the', 'DT'), Tree('CHUNK', [('book', 'NN')]), ('has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')])

This example only chunks nouns that follow a determiner, therefore ignoring the noun that follows an adjective. Here's how it would look using the RegexpParser class:

>>> chunker = RegexpParser(r'''
... NP:
... <DT>{<NN.*>}
... ''')
>>> chunker.parse(t)
Tree('S', [('the', 'DT'), Tree('NP', [('book', 'NN')]), ('has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')])

See also

In the next recipe, we'll cover merging and splitting chunks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset