Unlike most part-of-speech taggers, the ClassifierBasedTagger
class learns from features. That means we can create a ClassifierChunker
class that can learn from both the words and part-of-speech tags, instead of only the part-of-speech tags as the TagChunker
class does.
For the
ClassifierChunker
class, we don't want to discard the words from the training sentences as we did in the previous recipe. Instead, to remain compatible with the 2-tuple (word, pos)
format required for training a ClassiferBasedTagger
class, we convert the (word, pos, iob)
3-tuples from tree2conlltags()
into ((word, pos), iob)
2-tuples using the chunk_trees2train_chunks()
function. This code can be found in chunkers.py
:
from nltk.chunk import ChunkParserI from nltk.chunk.util import tree2conlltags, conlltags2tree from nltk.tag import ClassifierBasedTagger def chunk_trees2train_chunks(chunk_sents): tag_sents = [tree2conlltags(sent) for sent in chunk_sents] return [[((w,t),c) for (w,t,c) in sent] for sent in tag_sents]
Next, we need a feature detector function to pass into ClassifierBasedTagger
. Our default feature detector function, prev_next_pos_iob()
, knows that the list of tokens is really a list of (word, pos)
tuples, and can use that to return a feature set suitable for a classifier. In fact, any feature detector function used with the ClassifierChunker
class (defined next) should recognize that tokens are a list of (word, pos)
tuples, and have the same function signature as prev_next_pos_iob()
. To give the classifier as much information as we can, this feature set contains the current, previous, and next word and part-of-speech tag, along with the previous IOB tag:
def prev_next_pos_iob(tokens, index, history): word, pos = tokens[index] if index == 0: prevword, prevpos, previob = ('<START>',)*3 else: prevword, prevpos = tokens[index-1] previob = history[index-1] if index == len(tokens) - 1: nextword, nextpos = ('<END>',)*2 else: nextword, nextpos = tokens[index+1] feats = { 'word': word, 'pos': pos, 'nextword': nextword, 'nextpos': nextpos, 'prevword': prevword, 'prevpos': prevpos, 'previob': previob } return feats
Now, we can define the ClassifierChunker
class, which uses an internal ClassifierBasedTagger
with features extracted using prev_next_pos_iob()
and training sentences from chunk_trees2train_chunks()
. As a subclass of ChunkerParserI
, it implements the parse()
method, which converts the ((w, t), c)
tuples produced by the internal tagger into Trees
using conlltags2tree()
:
class ClassifierChunker(ChunkParserI): def __init__(self, train_sents, feature_detector=prev_next_pos_iob, **kwargs): if not feature_detector: feature_detector = self.feature_detector train_chunks = chunk_trees2train_chunks(train_sents) self.tagger = ClassifierBasedTagger(train=train_chunks, feature_detector=feature_detector, **kwargs) def parse(self, tagged_sent): if not tagged_sent: return None chunks = self.tagger.tag(tagged_sent) return conlltags2tree([(w,t,c) for ((w,t),c) in chunks])
Using the same train_chunks
and test_chunks
from the treebank_chunk
corpus in the previous recipe, we can evaluate this code from chunkers.py
:
>>> from chunkers import ClassifierChunker >>> chunker = ClassifierChunker(train_chunks) >>> score = chunker.evaluate(test_chunks) >>> score.accuracy() 0.9721733155838022 >>> score.precision() 0.9258838793383068 >>> score.recall() 0.9359016393442623
Compared to the TagChunker
class, all the scores have gone up a bit. Let's see how it does on conll2000
:
>>> chunker = ClassifierChunker(conll_train) >>> score = chunker.evaluate(conll_test) >>> score.accuracy() 0.9264622074002153 >>> score.precision() 0.8737924310910219 >>> score.recall() 0.9007354620620346
This is much improved over the TagChunker
class.
Like the TagChunker
class in the previous recipe, we are training a part-of-speech tagger for IOB tagging. But in this case, we want to include the word as a feature to power a classifier. By creating nested 2-tuples of the form ((word, pos), iob)
, we can pass the word through the tagger into our feature detector function. The chunk_trees2train_chunks()
method produces these nested 2-tuples, and prev_next_pos_iob()
is aware of them and uses each element as a feature. The following features are extracted:
The arguments to prev_next_pos_iob()
look the same as the feature_detector()
method of the ClassifierBasedTagger
class: tokens
, index
, and history
. But this time, tokens
will be a list of (word, pos)
two tuples, and history
will be a list of IOB tags. The special feature values <START>
and <END>
are used if there are no previous or next tokens.
The ClassifierChunker
class uses an internal ClassifierBasedTagger
and prev_next_pos_iob()
as its default feature_detector
. The results from the tagger, which are in the same nested 2-tuple form, are then reformated into 3-tuples to return a final Tree
using conlltags2tree()
.
You can use your own feature detector function by passing it into the ClassifierChunker
class as feature_detector
. The tokens
argument will contain a list of (word, tag)
tuples, and history
will be a list of the previous IOB tags found.
The ClassifierBasedTagger
class defaults to using NaiveBayesClassifier.train
as its classifier_builder
. But you can use any classifier you want by overriding the classifier_builder
keyword argument. Here's an example using MaxentClassifier.train
:
>>> from nltk.classify import MaxentClassifier >>> builder = lambda toks: MaxentClassifier.train(toks, trace=0, max_iter=10, min_lldelta=0.01) >>> me_chunker = ClassifierChunker(train_chunks, classifier_builder=builder) >>> score = me_chunker.evaluate(test_chunks) >>> score.accuracy() 0.9743204362949285 >>> score.precision() 0.9334423548650859 >>> score.recall() 0.9357377049180328
Instead of using MaxentClassifier.train
directly, I wrapped it in a lambda
argument so that its output is quite similar to (trace=0)
and it finishes in a reasonable amount of time. As you can see, the scores are slightly different compared to using the NaiveBayesClassifier
class.
The previous recipe, Training a tagger-based chunker, introduced the idea of using a part-of-speech tagger for training a chunker. The Classifier-based tagging recipe in Chapter 4, Part-of-speech Tagging, describes ClassifierBasedPOSTagger
, which is a subclass of ClassifierBasedTagger
. And in Chapter 7, Text Classification, we'll cover classification in detail.