One way to improve classification performance is to combine classifiers. The simplest way to combine multiple classifiers is to use voting, and choose whichever label gets the most votes. For this style of voting, it's best to have an odd number of classifiers so that there are no ties. This means combining at least three classifiers together. The individual classifiers should also use different algorithms; the idea is that multiple algorithms are better than one, and the combination of many can compensate for individual bias. However, combining a poorly performing classifier with better performing classifiers is generally not a good idea, because the poor performance of one classifier can bring the total accuracy down.
As we need to have at least three trained classifiers to combine, we are going to use a NaiveBayesClassifier
class, a DecisionTreeClassifier
class, and a MaxentClassifier
class, all trained on the highest information words of the movie_reviews
corpus. These were all trained in the previous recipe, so we will combine these three classifiers with voting.
In the classification.py
module, there is a MaxVoteClassifier
class:
import itertools from nltk.classify import ClassifierI from nltk.probability import FreqDist class MaxVoteClassifier(ClassifierI): def __init__(self, *classifiers): self._classifiers = classifiers self._labels = sorted(set(itertools.chain(*[c.labels() for c in classifiers]))) def labels(self): return self._labels def classify(self, feats): counts = FreqDist() for classifier in self._classifiers: counts[classifier.classify(feats)] += 1 return counts.max()
To create it, you pass in a list of classifiers that you want to combine. Once created, it works just like any other classifier. Though it may take about three times longer to classify, it should generally be at least as accurate as any individual classifier.
>>> from classification import MaxVoteClassifier >>> mv_classifier = MaxVoteClassifier(nb_classifier, dt_classifier, me_classifier, sk_classifier) >>> mv_classifier.labels() ['neg', 'pos'] >>> accuracy(mv_classifier, test_feats) 0.894 >>> mv_precisions, mv_recalls = precision_recall(mv_classifier, test_feats) >>> mv_precisions['pos'] 0.9156118143459916 >>> mv_precisions['neg'] 0.8745247148288974 >>> mv_recalls['pos'] 0.868 >>> mv_recalls['neg'] 0.92
These metrics are about on-par with the best sklearn classifiers, as well as the MaxentClassifier
and NaiveBayesClassifier
classes with high information features. Some numbers are slightly better, some worse. It's likely that a significant improvement to the DecisionTreeClassifier
class could produce better numbers.
The MaxVoteClassifier
class extends the nltk.classify.ClassifierI
interface, which requires the implementation of at least two methods:
labels()
method must return a list of possible labels. This will be the union of the labels()
method of each classifier passed in at initialization.classify()
method takes a single feature set and returns a label. The MaxVoteClassifier
class iterates over its classifiers and calls classify()
on each of them, recording their label as a vote in a FreqDist
variable. The label with the most votes is returned using FreqDist.max()
.The following is the inheritance diagram:
While it doesn't check for this, the MaxVoteClassifier
class assumes that all the classifiers passed in at initialization use the same labels. Breaking this assumption may lead to odd behavior.
In the previous recipe, we trained a NaiveBayesClassifier
class, a MaxentClassifier
class, and a DecisionTreeClassifier
class using only the highest information words. In the next recipe, we will use the reuters
corpus and combine many binary classifiers in order to create a multi-label classifier.