Some of the included corpora contain parsed sentences, which are often deep trees of nested phrases. Unfortunately, these trees are too deep to use for training a chunker, since IOB tag parsing is not designed for nested chunks. To make these trees usable for chunker training, we must flatten them.
We're going to use the first parsed sentence of the treebank
corpus as our example. Here's a diagram showing how deeply nested this tree is:
You may notice that the part-of-speech tags are part of the tree structure instead of being included with the word. This will be handled later using the Tree.pos()
method, which was designed specifically for combining words with preterminal Tree
labels such as part-of-speech tags.
In transforms.py
is a function named flatten_deeptree()
. It takes a single Tree
and will return a new Tree
that keeps only the lowest-level trees. It uses a helper function, flatten_childtrees()
, to do most of the work:
from nltk.tree import Tree def flatten_childtrees(trees): children = [] for t in trees: if t.height() < 3: children.extend(t.pos()) elif t.height() == 3: children.append(Tree(t.label(), t.pos())) else: children.extend(flatten_childtrees([c for c in t])) return children def flatten_deeptree(tree): return Tree(tree.label(), flatten_childtrees([c for c in tree]))
We can use it on the first parsed sentence of the treebank
corpus to get a flatter tree:
>>> from nltk.corpus import treebank >>> from transforms import flatten_deeptree >>> flatten_deeptree(treebank.parsed_sents()[0]) Tree('S', [Tree('NP', [('Pierre', 'NNP'), ('Vinken', 'NNP')]), (',', ','), Tree('NP', [('61', 'CD'), ('years', 'NNS')]), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), Tree('NP', [('the', 'DT'), ('board', 'NN')]), ('as', 'IN'), Tree('NP', [('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN')]), Tree('NP-TMP', [('Nov.', 'NNP'), ('29', 'CD')]), ('.', '.')])
The result is a much flatter Tree
that only includes NP
phrases. Words that are not part of an NP
phrase are separated. This flatter tree is shown in the following diagram:
This Tree
is quite similar to the first chunk Tree
from the treebank_chunk
corpus. The main difference is that the rightmost NP Tree
is separated into two subtrees above, one of them named NP-TMP
.
The first tree from treebank_chunk
is shown in the following diagram for comparison. The main difference is the right side of the tree, which has only one NP
subtree instead of two subtrees:
The solution is composed of two functions: flatten_deeptree()
returns a new Tree
from the given tree by calling flatten_childtrees()
on each of the given tree's children.
The flatten_childtrees()
function is a recursive function that drills down into the Tree
until it finds child trees whose height()
is equal to or less than 3
. A Tree
whose height()
is less than 3
looks like this:
>>> from nltk.tree import Tree >>> Tree('NNP', ['Pierre']).height()2
These short trees are converted into lists of tuples using the pos()
function.
>>> Tree('NNP', ['Pierre']).pos() [('Pierre', 'NNP')]
Trees whose height()
is equal to 3
are the lowest level trees that we're interested in keeping. These trees look like this:
>>> Tree('NP', [Tree('NNP', ['Pierre']), Tree('NNP', ['Vinken'])]).height() 3
And when we call pos()
on that tree, we get:
>>> Tree('NP', [Tree('NNP', ['Pierre']), Tree('NNP', ['Vinken'])]).pos() [('Pierre', 'NNP'), ('Vinken', 'NNP')]
The recursive nature of flatten_childtrees()
eliminates all trees whose height is greater than 3
.
Flattening a deep Tree
allows us to call nltk.chunk.util.tree2conlltags()
on the flattened Tree
, a necessary step to train a chunker. If you try to call this function before flattening the Tree
, you get a ValueError
exception:
>>> from nltk.chunk.util import tree2conlltags >>> tree2conlltags(treebank.parsed_sents()[0]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.6/dist-packages/nltk/chunk/util.py", line 417, in tree2conlltags raise ValueError, "Tree is too deeply nested to be printed in CoNLL format" ValueError: Tree is too deeply nested to be printed in CoNLL format
But after flattening, there's no problem:
>>> tree2conlltags(flatten_deeptree(treebank.parsed_sents()[0])) [('Pierre', 'NNP', 'B-NP'), ('Vinken', 'NNP', 'I-NP'), (',', ',', 'O'), ('61', 'CD', 'B-NP'), ('years', 'NNS', 'I-NP'), ('old', 'JJ', 'O'), (',', ',', 'O'), ('will', 'MD', 'O'), ('join', 'VB', 'O'), ('the', 'DT', 'B-NP'), ('board', 'NN', 'I-NP'), ('as', 'IN', 'O'), ('a', 'DT', 'B-NP'), ('nonexecutive', 'JJ', 'I-NP'), ('director', 'NN', 'I-NP'), ('Nov.', 'NNP', 'B-NP-TMP'), ('29', 'CD', 'I-NP-TMP'), ('.', '.', 'O')]
Being able to flatten trees opens up the possibility of training a chunker on corpora consisting of deep parse trees.
The cess_esp
and cess_cat
corpora are Spanish and Catalan corpora that have parsed sentences but no chunked sentences. In other words, they have deep trees that must be flattened in order to train a chunker. In fact, the trees are so deep that a diagram would be overwhelming, but the flattening can be demonstrated by showing the height()
of the tree before and after flattening:
>>> from nltk.corpus import cess_esp >>> cess_esp.parsed_sents()[0].height() 22 >>> flatten_deeptree(cess_esp.parsed_sents()[0]).height() 3
The Training a tagger-based chunker recipe in Chapter 5, Extracting Chunks, covers training a chunker using IOB tags.