The ClassifierBasedPOSTagger
class uses classification to do part-of-speech tagging. Features are extracted from words, and then passed to an internal classifier. The classifier classifies the features and returns a label, in this case, a part-of-speech tag. Classification will be covered in detail in Chapter 7, Text Classification.
The ClassifierBasedPOSTagger
class is a subclass of ClassifierBasedTagger
that implements a feature detector that combines many of the techniques of the previous taggers into a single feature set. The feature detector finds multiple length suffixes, does some regular expression matching, and looks at the unigram, bigram, and trigram history to produce a fairly complete set of features for each word. The feature sets it produces are used to train the internal classifier, and are used for classifying words into part-of-speech tags.
The basic usage of the ClassifierBasedPOSTagger
class is much like any other SequentialBackoffTaggger
. You pass in training sentences, it trains an internal classifier, and you get a very accurate tagger.
>>> from nltk.tag.sequential import ClassifierBasedPOSTagger >>> tagger = ClassifierBasedPOSTagger(train=train_sents) >>> tagger.evaluate(test_sents) 0.9309734513274336
The ClassifierBasedPOSTagger
class inherits from ClassifierBasedTagger
and only implements a feature_detector()
method. All the training and tagging is done in ClassifierBasedTagger
. It defaults to training a NaiveBayesClassifier
class with the given training data. Once this classifier is trained, it is used to classify word features produced by the feature_detector()
method.
The ClassifierBasedTagger
class also inherits from FeatursetTaggerI
(which is just an empty class), creating an inheritance tree that looks like this:
You can use a different classifier instead of NaiveBayesClassifier
by passing in your own classifier_builder
function. For example, to use a MaxentClassifier
, you'd do the following:
>>> from nltk.classify import MaxentClassifier >>> me_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=MaxentClassifier.train) >>> me_tagger.evaluate(test_sents) 0.9258363911072739
If you want to do your own feature detection, there are two ways to do it:
ClassifierBasedTagger
and implement a feature_detector()
method.feature_detector
keyword argument into ClassifierBasedTagger
at initialization.Either way, you need a feature detection method that can take the same arguments as choose_tag(): tokens, index, history
. But instead of returning a tag, you return a dict
of key-value features, where the key is the feature name and the value is the feature value. A very simple example would be a unigram feature detector (found in tag_util.py
).
def unigram_feature_detector(tokens, index, history): return {'word': tokens[index]}
Then, using the second method, you'd pass this into ClassifierBasedTagger
as feature_detector
.
>>> from nltk.tag.sequential import ClassifierBasedTagger >>> from tag_util import unigram_feature_detector >>> tagger = ClassifierBasedTagger(train=train_sents, feature_detector=unigram_feature_detector) >>> tagger.evaluate(test_sents) 0.8733865745737104
Because a classifier will always return the best result it can, passing in a backoff tagger is useless unless you also pass in a cutoff_prob
argument to specify the probability threshold for classification. Then, if the probability of the chosen tag is less than cutoff_prob
, the backoff tagger will be used. Here's an example using the DefaultTagger
class as the backoff, and setting cutoff_prob
to 0.3
:
>>> default = DefaultTagger('NN') >>> tagger = ClassifierBasedPOSTagger(train=train_sents, backoff=default, cutoff_prob=0.3) >>> tagger.evaluate(test_sents) 0.9311029570472696
So, we get a slight increase in accuracy if the ClassifierBasedPOSTagger
class uses the DefaultTagger
class whenever its tag probability is less than 30%.
If you want to use a classifier that's already been trained, then you can pass that into ClassifierBasedTagger
or ClassifierBasedPOSTagger
as the classifier
. In this case, the classifier_builder
argument is ignored and no training takes place. However, you must ensure that the classifier has been trained on and can classify feature sets produced by whatever feature_detector()
method you use.
Chapter 7, Text Classification, will cover classification in depth.