Flattening a deep tree

Some of the included corpora contain parsed sentences, which are often deep trees of nested phrases. Unfortunately, these trees are too deep to use for training a chunker, since IOB tag parsing is not designed for nested chunks. To make these trees usable for chunker training, we must flatten them.

Getting ready

We're going to use the first parsed sentence of the treebank corpus as our example. Here's a diagram showing how deeply nested this tree is:

Getting ready

You may notice that the part-of-speech tags are part of the tree structure instead of being included with the word. This will be handled later using the Tree.pos() method, which was designed specifically for combining words with preterminal Tree labels such as part-of-speech tags.

How to do it...

In transforms.py is a function named flatten_deeptree(). It takes a single Tree and will return a new Tree that keeps only the lowest-level trees. It uses a helper function, flatten_childtrees(), to do most of the work:

from nltk.tree import Tree

def flatten_childtrees(trees):
  children = []

  for t in trees:
    if t.height() < 3:
      children.extend(t.pos())
    elif t.height() == 3:
      children.append(Tree(t.label(), t.pos()))
    else:
      children.extend(flatten_childtrees([c for c in t]))

  return children

def flatten_deeptree(tree):
  return Tree(tree.label(), flatten_childtrees([c for c in tree]))

We can use it on the first parsed sentence of the treebank corpus to get a flatter tree:

>>> from nltk.corpus import treebank
>>> from transforms import flatten_deeptree
>>> flatten_deeptree(treebank.parsed_sents()[0])
Tree('S', [Tree('NP', [('Pierre', 'NNP'), ('Vinken', 'NNP')]), (',', ','), Tree('NP', [('61', 'CD'), ('years', 'NNS')]), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), Tree('NP', [('the', 'DT'), ('board', 'NN')]), ('as', 'IN'), Tree('NP', [('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN')]), Tree('NP-TMP', [('Nov.', 'NNP'), ('29', 'CD')]), ('.', '.')])

The result is a much flatter Tree that only includes NP phrases. Words that are not part of an NP phrase are separated. This flatter tree is shown in the following diagram:

How to do it...

This Tree is quite similar to the first chunk Tree from the treebank_chunk corpus. The main difference is that the rightmost NP Tree is separated into two subtrees above, one of them named NP-TMP.

The first tree from treebank_chunk is shown in the following diagram for comparison. The main difference is the right side of the tree, which has only one NP subtree instead of two subtrees:

How to do it...

How it works...

The solution is composed of two functions: flatten_deeptree() returns a new Tree from the given tree by calling flatten_childtrees() on each of the given tree's children.

The flatten_childtrees() function is a recursive function that drills down into the Tree until it finds child trees whose height() is equal to or less than 3. A Tree whose height() is less than 3 looks like this:

>>> from nltk.tree import Tree
>>> Tree('NNP', ['Pierre']).height()2
How it works...

These short trees are converted into lists of tuples using the pos() function.

>>> Tree('NNP', ['Pierre']).pos()
[('Pierre', 'NNP')]

Trees whose height() is equal to 3 are the lowest level trees that we're interested in keeping. These trees look like this:

How it works...
>>> Tree('NP', [Tree('NNP', ['Pierre']), Tree('NNP', ['Vinken'])]).height()
3

And when we call pos() on that tree, we get:

>>> Tree('NP', [Tree('NNP', ['Pierre']), Tree('NNP', ['Vinken'])]).pos()
[('Pierre', 'NNP'), ('Vinken', 'NNP')]

The recursive nature of flatten_childtrees() eliminates all trees whose height is greater than 3.

There's more...

Flattening a deep Tree allows us to call nltk.chunk.util.tree2conlltags() on the flattened Tree, a necessary step to train a chunker. If you try to call this function before flattening the Tree, you get a ValueError exception:

>>> from nltk.chunk.util import tree2conlltags
>>> tree2conlltags(treebank.parsed_sents()[0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.6/dist-packages/nltk/chunk/util.py", line 417, in tree2conlltags
    raise ValueError, "Tree is too deeply nested to be printed in CoNLL format"
ValueError: Tree is too deeply nested to be printed in CoNLL format

But after flattening, there's no problem:

>>> tree2conlltags(flatten_deeptree(treebank.parsed_sents()[0]))
[('Pierre', 'NNP', 'B-NP'), ('Vinken', 'NNP', 'I-NP'), (',', ',', 'O'), ('61', 'CD', 'B-NP'), ('years', 'NNS', 'I-NP'), ('old', 'JJ', 'O'), (',', ',', 'O'), ('will', 'MD', 'O'), ('join', 'VB', 'O'), ('the', 'DT', 'B-NP'), ('board', 'NN', 'I-NP'), ('as', 'IN', 'O'), ('a', 'DT', 'B-NP'), ('nonexecutive', 'JJ', 'I-NP'), ('director', 'NN', 'I-NP'), ('Nov.', 'NNP', 'B-NP-TMP'), ('29', 'CD', 'I-NP-TMP'), ('.', '.', 'O')]

Being able to flatten trees opens up the possibility of training a chunker on corpora consisting of deep parse trees.

The cess_esp and cess_cat treebank

The cess_esp and cess_cat corpora are Spanish and Catalan corpora that have parsed sentences but no chunked sentences. In other words, they have deep trees that must be flattened in order to train a chunker. In fact, the trees are so deep that a diagram would be overwhelming, but the flattening can be demonstrated by showing the height() of the tree before and after flattening:

>>> from nltk.corpus import cess_esp
>>> cess_esp.parsed_sents()[0].height()
22
>>> flatten_deeptree(cess_esp.parsed_sents()[0]).height()
3

See also

The Training a tagger-based chunker recipe in Chapter 5, Extracting Chunks, covers training a chunker using IOB tags.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset