Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Classification-based chunking

Unlike most part-of-speech taggers, the ClassifierBasedTagger class learns from features. That means we can create a ClassifierChunker class that can learn from both the words and part-of-speech tags, instead of only the part-of-speech tags as the TagChunker class does.

How to do it...

For the ClassifierChunker class, we don't want to discard the words from the training sentences as we did in the previous recipe. Instead, to remain compatible with the 2-tuple (word, pos) format required for training a ClassiferBasedTagger class, we convert the (word, pos, iob) 3-tuples from tree2conlltags() into ((word, pos), iob) 2-tuples using the chunk_trees2train_chunks() function. This code can be found in chunkers.py:

from nltk.chunk import ChunkParserI
from nltk.chunk.util import tree2conlltags, conlltags2tree
from nltk.tag import ClassifierBasedTagger

def chunk_trees2train_chunks(chunk_sents):
  tag_sents = [tree2conlltags(sent) for sent in chunk_sents]
  return [[((w,t),c) for (w,t,c) in sent] for sent in tag_sents]

Next, we need a feature detector function to pass into ClassifierBasedTagger. Our default feature detector function, prev_next_pos_iob(), knows that the list of tokens is really a list of (word, pos) tuples, and can use that to return a feature set suitable for a classifier. In fact, any feature detector function used with the ClassifierChunker class (defined next) should recognize that tokens are a list of (word, pos) tuples, and have the same function signature as prev_next_pos_iob(). To give the classifier as much information as we can, this feature set contains the current, previous, and next word and part-of-speech tag, along with the previous IOB tag:

def prev_next_pos_iob(tokens, index, history):
  word, pos = tokens[index]

  if index == 0:
    prevword, prevpos, previob = ('<START>',)*3
  else:
    prevword, prevpos = tokens[index-1]
    previob = history[index-1]

  if index == len(tokens) - 1:
    nextword, nextpos = ('<END>',)*2
  else:
    nextword, nextpos = tokens[index+1]

  feats = {
    'word': word,
    'pos': pos,
    'nextword': nextword,
    'nextpos': nextpos,
    'prevword': prevword,
    'prevpos': prevpos,
    'previob': previob
  }
  return feats

Now, we can define the ClassifierChunker class, which uses an internal ClassifierBasedTagger with features extracted using prev_next_pos_iob() and training sentences from chunk_trees2train_chunks(). As a subclass of ChunkerParserI, it implements the parse() method, which converts the ((w, t), c) tuples produced by the internal tagger into Trees using conlltags2tree():

class ClassifierChunker(ChunkParserI):
  def __init__(self, train_sents, feature_detector=prev_next_pos_iob, **kwargs):
    if not feature_detector:
       feature_detector = self.feature_detector

    train_chunks = chunk_trees2train_chunks(train_sents)
    self.tagger = ClassifierBasedTagger(train=train_chunks,
      feature_detector=feature_detector, **kwargs)

  def parse(self, tagged_sent):
    if not tagged_sent: return None
    chunks = self.tagger.tag(tagged_sent)
    return conlltags2tree([(w,t,c) for ((w,t),c) in chunks])

Using the same train_chunks and test_chunks from the treebank_chunk corpus in the previous recipe, we can evaluate this code from chunkers.py:

>>> from chunkers import ClassifierChunker
>>> chunker = ClassifierChunker(train_chunks)
>>> score = chunker.evaluate(test_chunks)
>>> score.accuracy()
0.9721733155838022
>>> score.precision()
0.9258838793383068
>>> score.recall()
0.9359016393442623

Compared to the TagChunker class, all the scores have gone up a bit. Let's see how it does on conll2000:

>>> chunker = ClassifierChunker(conll_train)
>>> score = chunker.evaluate(conll_test)
>>> score.accuracy()
0.9264622074002153
>>> score.precision()
0.8737924310910219
>>> score.recall()
0.9007354620620346

This is much improved over the TagChunker class.

How it works...

Like the TagChunker class in the previous recipe, we are training a part-of-speech tagger for IOB tagging. But in this case, we want to include the word as a feature to power a classifier. By creating nested 2-tuples of the form ((word, pos), iob), we can pass the word through the tagger into our feature detector function. The chunk_trees2train_chunks() method produces these nested 2-tuples, and prev_next_pos_iob() is aware of them and uses each element as a feature. The following features are extracted:

The current word and part-of-speech tag
The previous word, part-of-speech tag, and IOB tag
The next word and part-of-speech tag

The arguments to prev_next_pos_iob() look the same as the feature_detector() method of the ClassifierBasedTagger class: tokens, index, and history. But this time, tokens will be a list of (word, pos) two tuples, and history will be a list of IOB tags. The special feature values <START> and <END> are used if there are no previous or next tokens.

The ClassifierChunker class uses an internal ClassifierBasedTagger and prev_next_pos_iob() as its default feature_detector. The results from the tagger, which are in the same nested 2-tuple form, are then reformated into 3-tuples to return a final Tree using conlltags2tree().

There's more...

You can use your own feature detector function by passing it into the ClassifierChunker class as feature_detector. The tokens argument will contain a list of (word, tag) tuples, and history will be a list of the previous IOB tags found.

Using a different classifier builder

The ClassifierBasedTagger class defaults to using NaiveBayesClassifier.train as its classifier_builder. But you can use any classifier you want by overriding the classifier_builder keyword argument. Here's an example using MaxentClassifier.train:

>>> from nltk.classify import MaxentClassifier
>>> builder = lambda toks: MaxentClassifier.train(toks, trace=0, max_iter=10, min_lldelta=0.01)
>>> me_chunker = ClassifierChunker(train_chunks, classifier_builder=builder)
>>> score = me_chunker.evaluate(test_chunks)
>>> score.accuracy()
0.9743204362949285
>>> score.precision()
0.9334423548650859
>>> score.recall()
0.9357377049180328

Instead of using MaxentClassifier.train directly, I wrapped it in a lambda argument so that its output is quite similar to (trace=0) and it finishes in a reasonable amount of time. As you can see, the scores are slightly different compared to using the NaiveBayesClassifier class.

Note

The MaxentClassifier score values mentioned earlier were computed with the environment variable PYTHONHASHSEED=0. If you use a different value, or do not set this environment variable, your score values may differ.

Table of Contents for
Classification-based chunking

Classification-based chunking

How to do it...

How it works...

There's more...

Using a different classifier builder

Note

See also

Table of Contents for Classification-based chunking

Create new playlist

Sign In

Sign Up

Classification-based chunking

How to do it...

How it works...

There's more...

Using a different classifier builder

Note

See also

Table of Contents for
Classification-based chunking