Classifier-based tagging

The ClassifierBasedPOSTagger class uses classification to do part-of-speech tagging. Features are extracted from words, and then passed to an internal classifier. The classifier classifies the features and returns a label, in this case, a part-of-speech tag. Classification will be covered in detail in Chapter 7, Text Classification.

The ClassifierBasedPOSTagger class is a subclass of ClassifierBasedTagger that implements a feature detector that combines many of the techniques of the previous taggers into a single feature set. The feature detector finds multiple length suffixes, does some regular expression matching, and looks at the unigram, bigram, and trigram history to produce a fairly complete set of features for each word. The feature sets it produces are used to train the internal classifier, and are used for classifying words into part-of-speech tags.

How to do it...

The basic usage of the ClassifierBasedPOSTagger class is much like any other SequentialBackoffTaggger. You pass in training sentences, it trains an internal classifier, and you get a very accurate tagger.

>>> from nltk.tag.sequential import ClassifierBasedPOSTagger
>>> tagger = ClassifierBasedPOSTagger(train=train_sents)
>>> tagger.evaluate(test_sents)


Notice a slight modification to initialization: train_sents must be passed in as the train keyword argument.

How it works...

The ClassifierBasedPOSTagger class inherits from ClassifierBasedTagger and only implements a feature_detector() method. All the training and tagging is done in ClassifierBasedTagger. It defaults to training a NaiveBayesClassifier class with the given training data. Once this classifier is trained, it is used to classify word features produced by the feature_detector() method.


The ClassifierBasedTagger class is often the most accurate tagger, but it's also one of the slowest taggers. If speed is an issue, you should stick with a BrillTagger class based on a backoff chain of NgramTagger subclasses and other simple taggers.

The ClassifierBasedTagger class also inherits from FeatursetTaggerI (which is just an empty class), creating an inheritance tree that looks like this:

There's more...

You can use a different classifier instead of NaiveBayesClassifier by passing in your own classifier_builder function. For example, to use a MaxentClassifier, you'd do the following:

>>> from nltk.classify import MaxentClassifier
>>> me_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=MaxentClassifier.train)
>>> me_tagger.evaluate(test_sents)


The MaxentClassifier class takes even longer to train than NaiveBayesClassifier. If you have SciPy and NumPy installed, training will be faster than normal, but still slower than NaiveBayesClassifier.

Detecting features with a custom feature detector

If you want to do your own feature detection, there are two ways to do it:

  1. Subclass ClassifierBasedTagger and implement a feature_detector() method.
  2. Pass a function as the feature_detector keyword argument into ClassifierBasedTagger at initialization.

Either way, you need a feature detection method that can take the same arguments as choose_tag(): tokens, index, history. But instead of returning a tag, you return a dict of key-value features, where the key is the feature name and the value is the feature value. A very simple example would be a unigram feature detector (found in

def unigram_feature_detector(tokens, index, history):
  return {'word': tokens[index]}

Then, using the second method, you'd pass this into ClassifierBasedTagger as feature_detector.

>>> from nltk.tag.sequential import ClassifierBasedTagger
>>> from tag_util import unigram_feature_detector
>>> tagger = ClassifierBasedTagger(train=train_sents, feature_detector=unigram_feature_detector)
>>> tagger.evaluate(test_sents)

Setting a cutoff probability

Because a classifier will always return the best result it can, passing in a backoff tagger is useless unless you also pass in a cutoff_prob argument to specify the probability threshold for classification. Then, if the probability of the chosen tag is less than cutoff_prob, the backoff tagger will be used. Here's an example using the DefaultTagger class as the backoff, and setting cutoff_prob to 0.3:

>>> default = DefaultTagger('NN')
>>> tagger = ClassifierBasedPOSTagger(train=train_sents, backoff=default, cutoff_prob=0.3)
>>> tagger.evaluate(test_sents)

So, we get a slight increase in accuracy if the ClassifierBasedPOSTagger class uses the DefaultTagger class whenever its tag probability is less than 30%.

Using a pre-trained classifier

If you want to use a classifier that's already been trained, then you can pass that into ClassifierBasedTagger or ClassifierBasedPOSTagger as the classifier. In this case, the classifier_builder argument is ignored and no training takes place. However, you must ensure that the classifier has been trained on and can classify feature sets produced by whatever feature_detector() method you use.

