Classification-based chunking

Unlike most part-of-speech taggers, the ClassifierBasedTagger class learns from features. That means we can create a ClassifierChunker class that can learn from both the words and part-of-speech tags, instead of only the part-of-speech tags as the TagChunker class does.

How to do it...

For the ClassifierChunker class, we don't want to discard the words from the training sentences as we did in the previous recipe. Instead, to remain compatible with the 2-tuple (word, pos) format required for training a ClassiferBasedTagger class, we convert the (word, pos, iob) 3-tuples from tree2conlltags() into ((word, pos), iob) 2-tuples using the chunk_trees2train_chunks() function. This code can be found in

from nltk.chunk import ChunkParserI
from nltk.chunk.util import tree2conlltags, conlltags2tree
from nltk.tag import ClassifierBasedTagger

def chunk_trees2train_chunks(chunk_sents):
  tag_sents = [tree2conlltags(sent) for sent in chunk_sents]
  return [[((w,t),c) for (w,t,c) in sent] for sent in tag_sents]

Next, we need a feature detector function to pass into ClassifierBasedTagger. Our default feature detector function, prev_next_pos_iob(), knows that the list of tokens is really a list of (word, pos) tuples, and can use that to return a feature set suitable for a classifier. In fact, any feature detector function used with the ClassifierChunker class (defined next) should recognize that tokens are a list of (word, pos) tuples, and have the same function signature as prev_next_pos_iob(). To give the classifier as much information as we can, this feature set contains the current, previous, and next word and part-of-speech tag, along with the previous IOB tag:

def prev_next_pos_iob(tokens, index, history):
  word, pos = tokens[index]

  if index == 0:
    prevword, prevpos, previob = ('<START>',)*3
    prevword, prevpos = tokens[index-1]
    previob = history[index-1]

  if index == len(tokens) - 1:
    nextword, nextpos = ('<END>',)*2
    nextword, nextpos = tokens[index+1]

  feats = {
    'word': word,
    'pos': pos,
    'nextword': nextword,
    'nextpos': nextpos,
    'prevword': prevword,
    'prevpos': prevpos,
    'previob': previob
  return feats

Now, we can define the ClassifierChunker class, which uses an internal ClassifierBasedTagger with features extracted using prev_next_pos_iob() and training sentences from chunk_trees2train_chunks(). As a subclass of ChunkerParserI, it implements the parse() method, which converts the ((w, t), c) tuples produced by the internal tagger into Trees using conlltags2tree():

class ClassifierChunker(ChunkParserI):
  def __init__(self, train_sents, feature_detector=prev_next_pos_iob, **kwargs):
    if not feature_detector:
       feature_detector = self.feature_detector

    train_chunks = chunk_trees2train_chunks(train_sents)
    self.tagger = ClassifierBasedTagger(train=train_chunks,
      feature_detector=feature_detector, **kwargs)

  def parse(self, tagged_sent):
    if not tagged_sent: return None
    chunks = self.tagger.tag(tagged_sent)
    return conlltags2tree([(w,t,c) for ((w,t),c) in chunks])

Using the same train_chunks and test_chunks from the treebank_chunk corpus in the previous recipe, we can evaluate this code from

>>> from chunkers import ClassifierChunker
>>> chunker = ClassifierChunker(train_chunks)
>>> score = chunker.evaluate(test_chunks)
>>> score.accuracy()
>>> score.precision()
>>> score.recall()

Compared to the TagChunker class, all the scores have gone up a bit. Let's see how it does on conll2000:

>>> chunker = ClassifierChunker(conll_train)
>>> score = chunker.evaluate(conll_test)
>>> score.accuracy()
>>> score.precision()
>>> score.recall()

This is much improved over the TagChunker class.

How it works...

Like the TagChunker class in the previous recipe, we are training a part-of-speech tagger for IOB tagging. But in this case, we want to include the word as a feature to power a classifier. By creating nested 2-tuples of the form ((word, pos), iob), we can pass the word through the tagger into our feature detector function. The chunk_trees2train_chunks() method produces these nested 2-tuples, and prev_next_pos_iob() is aware of them and uses each element as a feature. The following features are extracted:

  • The current word and part-of-speech tag
  • The previous word, part-of-speech tag, and IOB tag
  • The next word and part-of-speech tag

The arguments to prev_next_pos_iob() look the same as the feature_detector() method of the ClassifierBasedTagger class: tokens, index, and history. But this time, tokens will be a list of (word, pos) two tuples, and history will be a list of IOB tags. The special feature values <START> and <END> are used if there are no previous or next tokens.

The ClassifierChunker class uses an internal ClassifierBasedTagger and prev_next_pos_iob() as its default feature_detector. The results from the tagger, which are in the same nested 2-tuple form, are then reformated into 3-tuples to return a final Tree using conlltags2tree().

There's more...

You can use your own feature detector function by passing it into the ClassifierChunker class as feature_detector. The tokens argument will contain a list of (word, tag) tuples, and history will be a list of the previous IOB tags found.

Using a different classifier builder

The ClassifierBasedTagger class defaults to using NaiveBayesClassifier.train as its classifier_builder. But you can use any classifier you want by overriding the classifier_builder keyword argument. Here's an example using MaxentClassifier.train:

>>> from nltk.classify import MaxentClassifier
>>> builder = lambda toks: MaxentClassifier.train(toks, trace=0, max_iter=10, min_lldelta=0.01)
>>> me_chunker = ClassifierChunker(train_chunks, classifier_builder=builder)
>>> score = me_chunker.evaluate(test_chunks)
>>> score.accuracy()
>>> score.precision()
>>> score.recall()

Instead of using MaxentClassifier.train directly, I wrapped it in a lambda argument so that its output is quite similar to (trace=0) and it finishes in a reasonable amount of time. As you can see, the scores are slightly different compared to using the NaiveBayesClassifier class.


The MaxentClassifier score values mentioned earlier were computed with the environment variable PYTHONHASHSEED=0. If you use a different value, or do not set this environment variable, your score values may differ.

See also

The previous recipe, Training a tagger-based chunker, introduced the idea of using a part-of-speech tagger for training a chunker. The Classifier-based tagging recipe in Chapter 4, Part-of-speech Tagging, describes ClassifierBasedPOSTagger, which is a subclass of ClassifierBasedTagger. And in Chapter 7, Text Classification, we'll cover classification in detail.

