Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Converting a chunk tree to text

At some point, you may want to convert a Tree or subtree back to a sentence or chunk string. This is mostly straightforward, except when it comes to properly outputting punctuation.

How to do it...

We'll use the first tree of the treebank_chunk corpus as our example. The obvious first step is to join all the words in the tree with a space:

>>> from nltk.corpus import treebank_chunk
>>> tree = treebank_chunk.chunked_sents()[0]
>>> ' '.join([w for w, t in tree.leaves()])
'Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .'

But as you can see, the punctuation isn't quite right. The commas and period are treated as individual words, and so get the surrounding spaces as well. But we can fix this using regular expression substitution. This is implemented in the chunk_tree_to_sent() function found in transforms.py:

import re
punct_re = re.compile(r's([,.;?])')

def chunk_tree_to_sent(tree, concat=' '):
  s = concat.join([w for w, t in tree.leaves()])
  return re.sub(punct_re, r'g<1>', s)

Using chunk_tree_to_sent() results in a cleaner sentence, with no space before each punctuation mark:

>>> from transforms import chunk_tree_to_sent
>>> chunk_tree_to_sent(tree)
'Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.'

How it works...

To correct the extra spaces in front of the punctuation, we create a regular expression, punct_re, that will match a space followed by any of the known punctuation characters. We have to escape both '.' and '?' with a '' since they are special characters. The punctuation is surrounded by parentheses so we can use the matched group for substitution.

Once we have our regular expression, we define chunk_tree_to_sent(), whose first step is to join the words by a concatenation character that defaults to a space. Then, we can call re.sub() to replace all the punctuation matches with just the punctuation group. This eliminates the space in front of the punctuation characters, resulting in a more correct string.

There's more...

We can simplify this function a little using nltk.tag.untag() to get words from the tree's leaves, instead of using our own list comprehension:

import nltk.tag, re
punct_re = re.compile(r's([,.;?])')

def chunk_tree_to_sent(tree, concat=' '):
  s = concat.join(nltk.tag.untag(tree.leaves()))
  return re.sub(punct_re, r'g<1>', s)

Table of Contents for
Converting a chunk tree to text

Converting a chunk tree to text

How to do it...

How it works...

There's more...

See also

Table of Contents for Converting a chunk tree to text

Create new playlist

Sign In

Sign Up

Converting a chunk tree to text

How to do it...

How it works...

There's more...

See also

Table of Contents for
Converting a chunk tree to text