Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Flattening a deep tree

Some of the included corpora contain parsed sentences, which are often deep trees of nested phrases. Unfortunately, these trees are too deep to use for training a chunker, since IOB tag parsing is not designed for nested chunks. To make these trees usable for chunker training, we must flatten them.

Getting ready

We're going to use the first parsed sentence of the treebank corpus as our example. Here's a diagram showing how deeply nested this tree is:

You may notice that the part-of-speech tags are part of the tree structure instead of being included with the word. This will be handled later using the Tree.pos() method, which was designed specifically for combining words with preterminal Tree labels such as part-of-speech tags.

How to do it...

In transforms.py is a function named flatten_deeptree(). It takes a single Tree and will return a new Tree that keeps only the lowest-level trees. It uses a helper function, flatten_childtrees(), to do most of the work:

from nltk.tree import Tree

def flatten_childtrees(trees):
  children = []

  for t in trees:
    if t.height() < 3:
      children.extend(t.pos())
    elif t.height() == 3:
      children.append(Tree(t.label(), t.pos()))
    else:
      children.extend(flatten_childtrees([c for c in t]))

  return children

def flatten_deeptree(tree):
  return Tree(tree.label(), flatten_childtrees([c for c in tree]))

We can use it on the first parsed sentence of the treebank corpus to get a flatter tree:

>>> from nltk.corpus import treebank
>>> from transforms import flatten_deeptree
>>> flatten_deeptree(treebank.parsed_sents()[0])
Tree('S', [Tree('NP', [('Pierre', 'NNP'), ('Vinken', 'NNP')]), (',', ','), Tree('NP', [('61', 'CD'), ('years', 'NNS')]), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), Tree('NP', [('the', 'DT'), ('board', 'NN')]), ('as', 'IN'), Tree('NP', [('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN')]), Tree('NP-TMP', [('Nov.', 'NNP'), ('29', 'CD')]), ('.', '.')])

The result is a much flatter Tree that only includes NP phrases. Words that are not part of an NP phrase are separated. This flatter tree is shown in the following diagram:

This Tree is quite similar to the first chunk Tree from the treebank_chunk corpus. The main difference is that the rightmost NP Tree is separated into two subtrees above, one of them named NP-TMP.

The first tree from treebank_chunk is shown in the following diagram for comparison. The main difference is the right side of the tree, which has only one NP subtree instead of two subtrees:

How it works...

The solution is composed of two functions: flatten_deeptree() returns a new Tree from the given tree by calling flatten_childtrees() on each of the given tree's children.

The flatten_childtrees() function is a recursive function that drills down into the Tree until it finds child trees whose height() is equal to or less than 3. A Tree whose height() is less than 3 looks like this:

>>> from nltk.tree import Tree
>>> Tree('NNP', ['Pierre']).height()2

These short trees are converted into lists of tuples using the pos() function.

>>> Tree('NNP', ['Pierre']).pos()
[('Pierre', 'NNP')]

Trees whose height() is equal to 3 are the lowest level trees that we're interested in keeping. These trees look like this:

>>> Tree('NP', [Tree('NNP', ['Pierre']), Tree('NNP', ['Vinken'])]).height()
3

And when we call pos() on that tree, we get:

>>> Tree('NP', [Tree('NNP', ['Pierre']), Tree('NNP', ['Vinken'])]).pos()
[('Pierre', 'NNP'), ('Vinken', 'NNP')]

The recursive nature of flatten_childtrees() eliminates all trees whose height is greater than 3.

There's more...

Flattening a deep Tree allows us to call nltk.chunk.util.tree2conlltags() on the flattened Tree, a necessary step to train a chunker. If you try to call this function before flattening the Tree, you get a ValueError exception:

>>> from nltk.chunk.util import tree2conlltags
>>> tree2conlltags(treebank.parsed_sents()[0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.6/dist-packages/nltk/chunk/util.py", line 417, in tree2conlltags
    raise ValueError, "Tree is too deeply nested to be printed in CoNLL format"
ValueError: Tree is too deeply nested to be printed in CoNLL format

But after flattening, there's no problem:

>>> tree2conlltags(flatten_deeptree(treebank.parsed_sents()[0]))
[('Pierre', 'NNP', 'B-NP'), ('Vinken', 'NNP', 'I-NP'), (',', ',', 'O'), ('61', 'CD', 'B-NP'), ('years', 'NNS', 'I-NP'), ('old', 'JJ', 'O'), (',', ',', 'O'), ('will', 'MD', 'O'), ('join', 'VB', 'O'), ('the', 'DT', 'B-NP'), ('board', 'NN', 'I-NP'), ('as', 'IN', 'O'), ('a', 'DT', 'B-NP'), ('nonexecutive', 'JJ', 'I-NP'), ('director', 'NN', 'I-NP'), ('Nov.', 'NNP', 'B-NP-TMP'), ('29', 'CD', 'I-NP-TMP'), ('.', '.', 'O')]

Being able to flatten trees opens up the possibility of training a chunker on corpora consisting of deep parse trees.

The cess_esp and cess_cat treebank

The cess_esp and cess_cat corpora are Spanish and Catalan corpora that have parsed sentences but no chunked sentences. In other words, they have deep trees that must be flattened in order to train a chunker. In fact, the trees are so deep that a diagram would be overwhelming, but the flattening can be demonstrated by showing the height() of the tree before and after flattening:

>>> from nltk.corpus import cess_esp
>>> cess_esp.parsed_sents()[0].height()
22
>>> flatten_deeptree(cess_esp.parsed_sents()[0]).height()
3

Table of Contents for
Flattening a deep tree

Flattening a deep tree

Getting ready

How to do it...

How it works...

There's more...

The cess_esp and cess_cat treebank

See also

Table of Contents for Flattening a deep tree

Create new playlist

Sign In

Sign Up

Flattening a deep tree

Getting ready

How to do it...

How it works...

There's more...

The cess_esp and cess_cat treebank

See also

Table of Contents for
Flattening a deep tree