In the previous recipe, we flattened a deep Tree
by only keeping the lowest level subtrees. In this recipe, we'll keep only the highest level subtrees instead.
We'll be using the first parsed sentence from the treebank
corpus as our example. Recall from the previous recipe that the sentence Tree
looks like this:
The shallow_tree()
function defined in transforms.py
eliminates all the nested subtrees, keeping only the top subtree labels:
from nltk.tree import Tree def shallow_tree(tree): children = [] for t in tree: if t.height() < 3: children.extend(t.pos()) else: children.append(Tree(t.label(), t.pos())) return Tree(tree.label(), children)
Using it on the first parsed sentence in treebank
results in a Tree
with only two subtrees:
>>> from transforms import shallow_tree >>> shallow_tree(treebank.parsed_sents()[0]) Tree('S', [Tree('NP-SBJ', [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ',')]), Tree('VP', [('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD')]), ('.', '.')])
We can visually and programmatically see the difference in the following diagram:
>>> treebank.parsed_sents()[0].height() 7 >>> shallow_tree(treebank.parsed_sents()[0]).height() 3
As in the previous recipe, the height of the new tree is 3
so it can be used for training a chunker.
The shallow_tree()
function iterates over each of the top-level subtrees in order to create new child trees. If the height()
of a subtree is less than 3
, then that subtree is replaced by a list of its part-of-speech tagged children. All other subtrees are replaced by a new Tree
whose children are the part-of-speech tagged leaves. This eliminates all nested subtrees while retaining the top-level subtrees.
This function is an alternative to flatten_deeptree()
from the previous recipe, for when you want to keep the higher-level tree labels and ignore the lower-level labels.