Partial parsing with regular expressions

So far, we've only been parsing noun phrases. But RegexpParser supports grammars with multiple phrase types, such as verb phrases and prepositional phrases. We can put the rules we've learned to use and define a grammar that can be evaluated against the conll2000 corpus, which has NP, VP, and PP phrases.

How to do it...

Now, we will define a grammar to parse three phrase types. For noun phrases, we have a ChunkRule class that looks for an optional determiner followed by one or more nouns. We then have a MergeRule class for adding an adjective to the front of a noun chunk. For prepositional phrases, we simply chunk any IN word, such as in or on. For verb phrases, we chunk an optional modal word (such as should) followed by a verb.

Note

Each grammar rule is followed by a # comment. This comment is passed into each rule as the description. Comments are optional, but they can be helpful notes for understanding what the rule does, and will be included in trace output.

>>> chunker = RegexpParser(r'''
... NP:
... {<DT>?<NN.*>+}  # chunk optional determiner with nouns
... <JJ>{}<NN.*>  # merge adjective with noun chunk
... PP:
... {<IN>}      # chunk preposition
... VP:
... {<MD>?<VB.*>}  # chunk optional modal with verb
... ''')
>>> from nltk.corpus import conll2000
>>> score = chunker.evaluate(conll2000.chunked_sents())
>>> score.accuracy()
0.6148573545757688

When we call evaluate() on the chunker argument, we give it a list of chunked sentences and get back a ChunkScore object, which can give us the accuracy of the chunker along with a number of other metrics.

How it works...

The RegexpParser class parses the grammar string into sets of rules, one set of rules for each phrase type. These rules are used to create a RegexpChunkParser class. The rules are parsed using RegexpChunkRule.fromstring(), which returns one of the five subclasses: ChunkRule, ChinkRule, MergeRule, SplitRule, or ChunkRuleWithContext.

Now that the grammar has been translated into sets of rules, these rules are used to parse a tagged sentence into a Tree structure. The RegexpParser class inherits from ChunkParserI, which provides a parse() method to parse the tagged words. Whenever a part of the tagged tokens matches a chunk rule, a subtree is constructed so that the tagged tokens become the leaves of a Tree whose label is the chunk tag. The ChunkParserI interface also provides the evaluate() method, which compares the given chunked sentences to the output of the parse() method to construct and return a ChunkScore object.

There's more...

You can also evaluate this chunker argument on the treebank_chunk corpus:

>>> from nltk.corpus import treebank_chunk
>>> treebank_score = chunker.evaluate(treebank_chunk.chunked_sents())
>>> treebank_score.accuracy()
0.49033970276008493

The treebank_chunk corpus is a special version of the treebank corpus that provides a chunked_sents() method. The regular treebank corpus cannot provide that method due to its file format.

The ChunkScore metrics

The ChunkScore metrics provide a few other metrics besides accuracy. Of the chunks the chunker argument was able to guess, precision tells you how many were correct and recall tells you how well the chunker did at finding correct chunks compared to how many total chunks there were. For more about precision and recall, see https://en.wikipedia.org/wiki/Precision_and_recall.

>>> score.precision()
0.60201948127375
>>> score.recall()
0.606072502505847

You can also get lists of chunks that were missed by the chunker, chunks that were incorrectly found, correct chunks, and the total guessed chunks. These can be useful to figure out how to improve your chunk grammar:

>>> len(score.missed())
47161
>>> len(score.incorrect())
47967
>>> len(score.correct())
119720
>>> len(score.guessed())
120526

As you can see by the number of incorrect chunks, and by comparing guessed() and correct(), our chunker guessed that there were more chunks than actually existed. And it also missed a good number of correct chunks.

Looping and tracing chunk rules

If you want to apply the chunk rules in your grammar more than once, you can pass loop=2 into RegexpParser at initialization. The default is loop=1, which will apply each rule once. Since a chunk can change after every rule application, it may sometimes make sense to re-apply the same rules multiple times.

To watch an internal trace of the chunking process, pass trace=1 into RegexpParser. To get even more output, pass in trace=2. This will give you a printout of what the chunker is doing as it is doing it. Rule comments/descriptions will be included in the trace output, giving you a good idea of which rule is applied when.

See also

If coming up with regular expression chunk patterns seems like too much work, then read the next recipes, where we'll cover how to train a chunker based on a corpus of chunked sentences.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset