Using modified regular expressions, we can define chunk patterns. These are patterns of part-of-speech tags that define what kinds of words make up a chunk. We can also define patterns for what kinds of words should not be in a chunk. These unchunked words are known as chinks.
A ChunkRule
class specifies what to include in a chunk, while a ChinkRule
class specifies what to exclude from a chunk. In other words, chunking creates chunks, while chinking breaks up those chunks.
We first need to know how to define chunk patterns. These are modified regular expressions designed to match sequences of part-of-speech tags. An individual tag is specified by surrounding angle brackets, such as <NN>
to match a noun tag. Multiple tags can then be combined, as in <DT><NN>
to match a determiner followed by a noun. Regular expression syntax can be used within the angle brackets to match individual tag patterns, so you can do <NN.*>
to match all nouns including NN
and NNS
. You can also use regular expression syntax outside of the angle brackets to match patterns of tags. <DT>?<NN.*>+
will match an optional determiner followed by one or more nouns. The chunk patterns are internally converted to regular expressions using the tag_pattern2re_pattern()
function.
>>> from nltk.chunk.regexp import tag_pattern2re_pattern >>> tag_pattern2re_pattern('<DT>?<NN.*>+') '(<(DT)>)?(<(NN[^\{\}<>]*)>)+'
You don't have to use this function to do chunking, but it might be useful or interesting to see how your chunk patterns convert to regular expressions. This function is used by the RegexpParser
class (explained in the next section) to convert chunk patterns into regular expressions to match chunking rules.
The pattern for specifying a chunk is to use surrounding curly braces, such as {<DT><NN>}
. To specify a chink, you flip the braces, such as }<VB>{
. These rules can be combined into a grammar for a particular phrase type. Here's a grammar for noun phrases that combines both a chunk and a chink pattern, along with the result of parsing the sentence the book has many chapters
:
>>> from nltk.chunk import RegexpParser >>> chunker = RegexpParser(r''' ... NP: ... {<DT><NN.*><.*>*<NN.*>} ... }<VB.*>{ ... ''') >>> chunker.parse([('the', 'DT'), ('book', 'NN'), ('has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')]) Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])
The grammar tells the RegexpParser
class that there are two rules for parsing NP
chunks. The first chunk pattern says that a chunk starts with a determiner followed by any kind of noun. Then, any number of other words are allowed until a final noun is found. The second pattern says that verbs should be chinked, thus separating any large chunks that contain a verb. The result is a tree with two noun-phrase chunks: the book
and many chapters
.
Tagged sentences are always parsed into a Tree
(found in the nltk.tree
module). The top label of the Tree
is S
, which stands for sentence. Any chunks found will be subtrees whose labels will refer to the chunk type. In this case, the chunk type is NP
for noun-phrase chunks. Trees can be drawn calling the draw()
method using t.draw()
.
Here's what happens, step-by-step:
Tree
:Tree
is used to create a ChunkString
.RegexpParser
parses the grammar to create a NP RegexpChunkParser
with the given rules.ChunkRule
is created and applied to the ChunkString
, which matches the entire sentence into a chunk:ChinkRule
is created and applied to the same ChunkString
, which splits the big chunk into two smaller chunks with a verb between them:ChunkString
is converted back to a Tree
, now with two NP
chunk subtrees:You can do this yourself using the classes in nltk.chunk.regexp
. The ChunkRule
and ChinkRule
classes are both subclasses of RegexpChunkRule
, and require two arguments: the pattern and a description of the rule. ChunkString
is an object that starts with a flat tree, which is then modified by each rule when it is passed into the rule's apply()
method. A ChunkString
is converted back to a Tree
with the to_chunkstruct()
method. Here's some code to demonstrate this:
>>> from nltk.chunk.regexp import ChunkString, ChunkRule, ChinkRule >>> from nltk.tree import Tree >>> t = Tree('S', [('the', 'DT'), ('book', 'NN'), ('has', 'VBZ'), 'many', 'JJ'), ('chapters', 'NNS')]) >>> cs = ChunkString(t) >>> cs <ChunkString: '<DT><NN><VBZ><JJ><NNS>'> >>> ur = ChunkRule('<DT><NN.*><.*>*<NN.*>', 'chunk determiners and nouns') >>> ur.apply(cs) >>> cs <ChunkString: '{<DT><NN><VBZ><JJ><NNS>}'> >>> ir = ChinkRule('<VB.*>', 'chink verbs') >>> ir.apply(cs) >>> cs <ChunkString: '{<DT><NN>}<VBZ>{<JJ><NNS>}'> >>> cs.to_chunkstruct() Tree('S', [Tree('CHUNK', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('CHUNK', [('many', 'JJ'), ('chapters', 'NNS')])])
The tree diagrams shown earlier can be drawn at each step by calling cs.to_chunkstruct().draw()
.
You'll notice that the subtrees from the ChunkString
class are tagged as CHUNK
and not NP
. That's because the rules mentioned earlier are phrase agnostic; they create chunks without needing to know what kind of chunks they are.
Internally, the RegexpParser
class creates a RegexpChunkParser
for each chunk phrase type. So, if you're only chunking NP
phrases, there will only be one RegexpChunkParser
. The RegexpChunkParser
class gets all the rules for the specific chunk type, and handles applying the rules in order and converting the CHUNK
trees to the specific chunk type, such as NP
.
Here's some code to illustrate the usage of RegexpChunkParser
. We pass both the rules mentioned earlier into the RegexpChunkParser
class, and then parse the same sentence tree we created before. The resulting tree is just like what we got from applying both rules in order, except that CHUNK
has been replaced with NP
in both the subtrees. This is because RegexpChunkParser
defaults to chunk_label='NP'
.
>>> from nltk.chunk import RegexpChunkParser >>> chunker = RegexpChunkParser([ur, ir]) >>> chunker.parse(t) Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])
If you wanted to parse a different chunk type, then you could pass that in as chunk_label
to RegexpChunkParser
. Here's the same code that we saw in the previous section, but instead of NP
subtrees, we'll call them CP
for custom phrase:
>>> from nltk.chunk import RegexpChunkParser >>> chunker = RegexpChunkParser([ur, ir], chunk_label='CP') >>> chunker.parse(t) Tree('S', [Tree('CP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('CP', [('many', 'JJ'), ('chapters', 'NNS')])])
The RegexpParser
class does this internally when you specify multiple phrase types. This will be covered in the Partial parsing with regular expressions recipe.
The same parsing results can be obtained using two chunk patterns in the grammar and discarding the chink pattern:
>>> chunker = RegexpParser(r''' ... NP: ... {<DT><NN.*>} ... {<JJ><NN.*>} ... ''') >>> chunker.parse(t) Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])
In fact, you could reduce the two chunk patterns into a single pattern.
>>> chunker = RegexpParser(r''' ... NP: ... {(<DT>|<JJ>)<NN.*>} ... ''') >>> chunker.parse(t) Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])
How you create and combine patterns is really up to you. Pattern creation is a process of trial and error, and entirely depends on what your data looks like and which patterns are easiest to express.
You can also create chunk rules with a surrounding tag context. For example, if your pattern is <DT>{<NN>}
, that will be parsed into a ChunkRuleWithContext
class. So, context in this case is referring to the parts of the rule that are not chinks or chunks, such as <DT>
. For example, in the phrase the dog
, the
would be context to the noun dog
. Any time there's a tag on either side of the curly braces, you'll get a ChunkRuleWithContext
class instead of a ChunkRule
class. This can allow you to be more specific about when to parse particular kinds of chunks.
Here's an example of using ChunkRuleWithContext
directly. It takes four arguments: the left context, the pattern to chunk, the right context, and a description:
>>> from nltk.chunk.regexp import ChunkRuleWithContext >>> ctx = ChunkRuleWithContext('<DT>', '<NN.*>', '<.*>', 'chunk nouns only after determiners') >>> cs = ChunkString(t) >>> cs <ChunkString: '<DT><NN><VBZ><JJ><NNS>'> >>> ctx.apply(cs) >>> cs <ChunkString: '<DT>{<NN>}<VBZ><JJ><NNS>'> >>> cs.to_chunkstruct() Tree('S', [('the', 'DT'), Tree('CHUNK', [('book', 'NN')]), ('has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')])
This example only chunks nouns that follow a determiner, therefore ignoring the noun that follows an adjective. Here's how it would look using the RegexpParser
class:
>>> chunker = RegexpParser(r''' ... NP: ... <DT>{<NN.*>} ... ''') >>> chunker.parse(t) Tree('S', [('the', 'DT'), Tree('NP', [('book', 'NN')]), ('has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')])