A chunk is a short phrase within a sentence. If you remember sentence diagrams from grade school, they were a tree-like representation of phrases within a sentence. This is exactly what chunks are: subtrees within a sentence tree, and they will be covered in much more detail in Chapter 5, Extracting Chunks. The following is a sample sentence tree with three Noun Phrase (NP) chunks shown as subtrees:
This recipe will cover how to create a corpus with sentences that contain chunks.
The following is an excerpt from the tagged treebank
corpus. It has part-of-speech tags, as in the previous recipe, but it also has square brackets for denoting chunks. The text within the brackets has been highlighted to make the chunks more apparent. The following sentence is the same sentence as in the previous tree diagram, but in text form:
[Earlier/JJR staff-reduction/NN moves/NNS] have/VBP trimmed/VBN about/IN [300/CD jobs/NNS] ,/, [the/DT spokesman/NN] said/VBD ./.
In this format, every chunk is a noun phrase. Words that are not within brackets are part of the sentence tree, but are not part of any noun phrase subtree.
Put the previous excerpt into a file called treebank.chunk
, and then do the following:
>>> from nltk.corpus.reader import ChunkedCorpusReader >>> reader = ChunkedCorpusReader('.', r'.*.chunk') >>> reader.chunked_words() [Tree('NP', [('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS')]), ('have', 'VBP'), ...] >>> reader.chunked_sents() [Tree('S', [Tree('NP', [('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS')]), ('have', 'VBP'), ('trimmed', 'VBN'), ('about', 'IN'), Tree('NP', [('300', 'CD'), ('jobs', 'NNS')]), (',', ','), Tree('NP', [('the', 'DT'), ('spokesman', 'NN')]), ('said', 'VBD'), ('.', '.')])] >>> reader.chunked_paras() [[Tree('S', [Tree('NP', [('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS')]), ('have', 'VBP'), ('trimmed', 'VBN'), ('about', 'IN'), Tree('NP', [('300', 'CD'), ('jobs', 'NNS')]), (',', ','), Tree('NP', [('the', 'DT'), ('spokesman', 'NN')]), ('said', 'VBD'), ('.', '.')])]]
The ChunkedCorpusReader
class provides the same methods as the TaggedCorpusReader
for getting tagged tokens, along with three new methods for getting chunks. Each chunk is represented as an instance of nltk.tree.Tree.
Sentence level trees look like Tree('S', [...])
while noun phrase trees look like Tree('NP', [...])
. In chunked_sents()
, you get a list of sentence trees, with each noun phrase as a subtree of the sentence. In chunked_words()
, you get a list of noun phrase trees alongside tagged tokens of words that were not in a chunk. The following is an inheritance diagram listing the major methods:
The ChunkedCorpusReader
class is similar to the TaggedCorpusReader
class from the previous recipe. It has the same default sent_tokenizer
and para_block_reader
functions, but instead of a word_tokenizer
function, it uses a str2chunktree()
function. The default is nltk.chunk.util.tagstr2tree()
, which parses a sentence string containing bracketed chunks into a sentence tree, with each chunk as a noun phrase subtree. Words are split by whitespace, and the default word/tag separator is '/'
. If you want to customize chunk parsing, then you can pass in your own function for str2chunktree()
.
An alternative format for denoting chunks is called IOB tags. IOB tags are similar to part-of-speech tags, but provide a way to denote the inside, outside, and beginning of a chunk. They also have the benefit of allowing multiple different chunk phrase types, not just noun phrases. The following is an excerpt from the conll2000
corpus. Each word is on its own line with a part-of-speech tag followed by an IOB tag:
Mr. NNP B-NP Meador NNP I-NP had VBD B-VP been VBN I-VP executive JJ B-NP vice NN I-NP president NN I-NP of IN B-PP Balcor NNP B-NP . . O
B-NP
denotes the beginning of a noun phrase, while I-NP
denotes that the word is inside of the current noun phrase. B-VP
and I-VP
denote the beginning and inside of a verb phrase. O
ends the sentence.
To read a corpus using the IOB format, you must use the ConllChunkCorpusReader
class. Each sentence is separated by a blank line, but there is no separation for paragraphs. This means that the para_*
methods are not available. If you put the previous IOB example text into a file named conll.iob
, you can create and use a ConllChunkCorpusReader
class with the following code. The third argument to ConllChunkCorpusReader
should be a tuple or list specifying the types of chunks in the file, which in this case is ('NP', 'VP', 'PP')
:
>>> from nltk.corpus.reader import ConllChunkCorpusReader >>> conllreader = ConllChunkCorpusReader('.', r'.*.iob', ('NP', 'VP', 'PP')) >>> conllreader.chunked_words() [Tree('NP', [('Mr.', 'NNP'), ('Meador', 'NNP')]), Tree('VP', [('had', 'VBD'), ('been', 'VBN')]), ...] >>> conllreader.chunked_sents() [Tree('S', [Tree('NP', [('Mr.', 'NNP'), ('Meador', 'NNP')]), Tree('VP', [('had', 'VBD'), ('been', 'VBN')]), Tree('NP', [('executive', 'JJ'), ('vice', 'NN'), ('president', 'NN')]), Tree('PP', [('of', 'IN')]), Tree('NP', [('Balcor', 'NNP')]), ('.', '.')])] >>> conllreader.iob_words() [('Mr.', 'NNP', 'B-NP'), ('Meador', 'NNP', 'I-NP'), ...] >>> conllreader.iob_sents() [[('Mr.', 'NNP', 'B-NP'), ('Meador', 'NNP', 'I-NP'), ('had', 'VBD', 'B-VP'), ('been', 'VBN', 'I-VP'), ('executive', 'JJ', 'B-NP'), ('vice', 'NN', 'I-NP'), ('president', 'NN', 'I-NP'), ('of', 'IN', 'B-PP'), ('Balcor', 'NNP', 'B-NP'), ('.', '.', 'O')]]
The previous code also shows the iob_words()
and iob_sents()
methods, which return lists of three tuples of (word, pos, iob)
. The inheritance diagram for ConllChunkCorpusReader
looks like the following diagram, with most of the methods implemented by its superclass, ConllCorpusReader
:
When it comes to chunk trees, the leaves of a tree are the tagged tokens. So if you want to get a list of all the tagged tokens in a tree, call the leaves()
method using the following code:
>>> reader.chunked_words()[0].leaves() [('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS')] >>> reader.chunked_sents()[0].leaves() [('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS'), ('have', 'VBP'), ('trimmed', 'VBN'), ('about', 'IN'), ('300', 'CD'), ('jobs', 'NNS'), (',', ','), ('the', 'DT'), ('spokesman', 'NN'), ('said', 'VBD'), ('.', '.')] >>> reader.chunked_paras()[0][0].leaves() [('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS'), ('have', 'VBP'), ('trimmed', 'VBN'), ('about', 'IN'), ('300', 'CD'), ('jobs', 'NNS'), (',', ','), ('the', 'DT'), ('spokesman', 'NN'), ('said', 'VBD'), ('.', '.')]
The nltk.corpus.treebank_chunk
corpus uses ChunkedCorpusReader
to provide part-of-speech tagged words and noun phrase chunks of Wall Street
Journal headlines. NLTK comes with a 5 percent sample from the Penn Treebank Project. You can find out more at http://www.cis.upenn.edu/~treebank/home.html.
CoNLL stands for the Conference on Computational Natural Language Learning. For the year 2000 conference, a shared task was undertaken to produce a corpus of chunks based on the Wall Street Journal corpus. In addition to Noun Phrases (NP), it also contains Verb Phrases (VP) and Prepositional Phrases (PP). This chunked corpus is available as nltk.corpus.conll2000
, which is an instance of ConllChunkCorpusReader
. You can read more at http://www.cnts.ua.ac.be/conll2000/chunking/.
Chapter 5, Extracting Chunks, will cover chunk extraction in detail. Also see the previous recipe for details on getting tagged tokens from a corpus reader.