As you've seen in previous recipes, parse trees often have a variety of Tree
label types that are not present in chunk trees. If you want to use parse trees to train a chunker, then you'll probably want to reduce this variety by converting some of these tree labels to more common label types.
First, we have to decide which Tree
labels need to be converted. Let's take a look at that first Tree
again:
Immediately, you can see that there are two alternative NP
subtrees: NP-SBJ
and NP-TMP
. Let's convert both of those to NP
. The mapping will be as follows:
Original Label |
New Label |
---|---|
NP-SBJ |
NP |
NP-TMP |
NP |
In transforms.py
is the function convert_tree_labels()
. It takes two arguments: the Tree
to convert and a label conversion mapping. It returns a new Tree
with all matching labels replaced based on the values in the mapping:
from nltk.tree import Tree def convert_tree_labels(tree, mapping): children = [] for t in tree: if isinstance(t, Tree): children.append(convert_tree_labels(t, mapping)) else: children.append(t) label = mapping.get(tree.label(), tree.label()) return Tree(label, children)
Using the mapping table we saw earlier, we can pass it in as a dict
to convert_tree_labels()
and convert the first parsed sentence from treebank
:
>>> from transforms import convert_tree_labels >>> mapping = {'NP-SBJ': 'NP', 'NP-TMP': 'NP'} >>> convert_tree_labels(treebank.parsed_sents()[0], mapping) Tree('S', [Tree('NP', [Tree('NP', [Tree('NNP', ['Pierre']), Tree('NNP', ['Vinken'])]), Tree(',', [',']), Tree('ADJP', [Tree('NP', [Tree('CD', ['61']), Tree('NNS', ['years'])]), Tree('JJ', ['old'])]), Tree(',', [','])]), Tree('VP', [Tree('MD', ['will']), Tree('VP', [Tree('VB', ['join']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['board'])]), Tree('PP-CLR', [Tree('IN', ['as']), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['nonexecutive']), Tree('NN', ['director'])])]), Tree('NP', [Tree('NNP', ['Nov.']), Tree('CD', ['29'])])])]), Tree('.', ['.'])])
As you can see in the following diagram, the NP-*
subtrees have been replaced with NP
subtrees:
The convert_tree_labels()
function recursively converts every child subtree using the mapping. The Tree
is then rebuilt with the converted labels and children until the entire Tree
has been converted.
The result is a brand new Tree
instance with new subtrees whose labels have been converted.
The previous two recipes cover different methods of flattening a parse Tree
, both of which can produce subtrees that may require mapping before using them to train a chunker. Chunker training is covered in the Training a tagger-based chunker recipe in Chapter 5, Extracting Chunks.