So far, we've only been parsing noun phrases. But RegexpParser
supports grammars with multiple phrase types, such as verb phrases and prepositional phrases. We can put the rules we've learned to use and define a grammar that can be evaluated against the conll2000
corpus, which has NP
, VP
, and PP
phrases.
Now, we will define a grammar to parse three phrase types. For noun phrases, we have a ChunkRule
class that looks for an optional determiner followed by one or more nouns. We then have a MergeRule
class for adding an adjective to the front of a noun chunk. For prepositional phrases, we simply chunk any IN
word, such as in
or on
. For verb phrases, we chunk an optional modal word (such as should
) followed by a verb.
>>> chunker = RegexpParser(r''' ... NP: ... {<DT>?<NN.*>+} # chunk optional determiner with nouns ... <JJ>{}<NN.*> # merge adjective with noun chunk ... PP: ... {<IN>} # chunk preposition ... VP: ... {<MD>?<VB.*>} # chunk optional modal with verb ... ''') >>> from nltk.corpus import conll2000 >>> score = chunker.evaluate(conll2000.chunked_sents()) >>> score.accuracy() 0.6148573545757688
When we call evaluate()
on the chunker
argument, we give it a list of chunked sentences and get back a ChunkScore
object, which can give us the accuracy of the chunker
along with a number of other metrics.
The RegexpParser
class parses the grammar string into sets of rules, one set of rules for each phrase type. These rules are used to create a RegexpChunkParser
class. The rules are parsed using RegexpChunkRule.fromstring()
, which returns one of the five subclasses: ChunkRule
, ChinkRule
, MergeRule
, SplitRule
, or ChunkRuleWithContext
.
Now that the grammar has been translated into sets of rules, these rules are used to parse a tagged sentence into a Tree
structure. The RegexpParser
class inherits from ChunkParserI
, which provides a parse()
method to parse the tagged words. Whenever a part of the tagged tokens matches a chunk rule, a subtree is constructed so that the tagged tokens become the leaves of a Tree
whose label is the chunk tag. The ChunkParserI
interface also provides the evaluate()
method, which compares the given chunked sentences to the output of the parse()
method to construct and return a ChunkScore
object.
You can also evaluate this chunker
argument on the treebank_chunk
corpus:
>>> from nltk.corpus import treebank_chunk >>> treebank_score = chunker.evaluate(treebank_chunk.chunked_sents()) >>> treebank_score.accuracy() 0.49033970276008493
The treebank_chunk
corpus is a special version of the treebank
corpus that provides a chunked_sents()
method. The regular treebank
corpus cannot provide that method due to its file format.
The
ChunkScore
metrics provide a few other metrics besides accuracy. Of the chunks the chunker
argument was able to guess, precision tells you how many were correct and recall tells you how well the chunker did at finding correct chunks compared to how many total chunks there were. For more about precision
and recall
, see https://en.wikipedia.org/wiki/Precision_and_recall.
>>> score.precision() 0.60201948127375 >>> score.recall() 0.606072502505847
You can also get lists of chunks that were missed by the chunker
, chunks that were incorrectly found, correct chunks, and the total guessed chunks. These can be useful to figure out how to improve your chunk grammar:
>>> len(score.missed()) 47161 >>> len(score.incorrect()) 47967 >>> len(score.correct()) 119720 >>> len(score.guessed()) 120526
As you can see by the number of incorrect chunks, and by comparing guessed()
and correct()
, our chunker guessed that there were more chunks than actually existed. And it also missed a good number of correct chunks.
If you want to apply the chunk rules in your grammar more than once, you can pass loop=2
into RegexpParser
at initialization. The default is loop=1
, which will apply each rule once. Since a chunk can change after every rule application, it may sometimes make sense to re-apply the same rules multiple times.
To watch an internal trace of the chunking process, pass trace=1
into RegexpParser
. To get even more output, pass in trace=2
. This will give you a printout of what the chunker
is doing as it is doing it. Rule comments/descriptions will be included in the trace output, giving you a good idea of which rule is applied when.