Combining classifiers with voting

One way to improve classification performance is to combine classifiers. The simplest way to combine multiple classifiers is to use voting, and choose whichever label gets the most votes. For this style of voting, it's best to have an odd number of classifiers so that there are no ties. This means combining at least three classifiers together. The individual classifiers should also use different algorithms; the idea is that multiple algorithms are better than one, and the combination of many can compensate for individual bias. However, combining a poorly performing classifier with better performing classifiers is generally not a good idea, because the poor performance of one classifier can bring the total accuracy down.

Getting ready

As we need to have at least three trained classifiers to combine, we are going to use a NaiveBayesClassifier class, a DecisionTreeClassifier class, and a MaxentClassifier class, all trained on the highest information words of the movie_reviews corpus. These were all trained in the previous recipe, so we will combine these three classifiers with voting.

How to do it...

In the module, there is a MaxVoteClassifier class:

import itertools
from nltk.classify import ClassifierI
from nltk.probability import FreqDist

class MaxVoteClassifier(ClassifierI):
  def __init__(self, *classifiers):
    self._classifiers = classifiers
    self._labels = sorted(set(itertools.chain(*[c.labels() for c in classifiers])))

  def labels(self):
    return self._labels

  def classify(self, feats):
    counts = FreqDist()

    for classifier in self._classifiers:
      counts[classifier.classify(feats)] += 1

    return counts.max()

To create it, you pass in a list of classifiers that you want to combine. Once created, it works just like any other classifier. Though it may take about three times longer to classify, it should generally be at least as accurate as any individual classifier.

>>> from classification import MaxVoteClassifier
>>> mv_classifier = MaxVoteClassifier(nb_classifier, dt_classifier, me_classifier, sk_classifier)
>>> mv_classifier.labels()
['neg', 'pos']
>>> accuracy(mv_classifier, test_feats)
>>> mv_precisions, mv_recalls = precision_recall(mv_classifier, test_feats)
>>> mv_precisions['pos']
>>> mv_precisions['neg']
>>> mv_recalls['pos']
>>> mv_recalls['neg']

These metrics are about on-par with the best sklearn classifiers, as well as the MaxentClassifier and NaiveBayesClassifier classes with high information features. Some numbers are slightly better, some worse. It's likely that a significant improvement to the DecisionTreeClassifier class could produce better numbers.

How it works...

The MaxVoteClassifier class extends the nltk.classify.ClassifierI interface, which requires the implementation of at least two methods:

  • The labels() method must return a list of possible labels. This will be the union of the labels() method of each classifier passed in at initialization.
  • The classify() method takes a single feature set and returns a label. The MaxVoteClassifier class iterates over its classifiers and calls classify() on each of them, recording their label as a vote in a FreqDist variable. The label with the most votes is returned using FreqDist.max().

The following is the inheritance diagram:

While it doesn't check for this, the MaxVoteClassifier class assumes that all the classifiers passed in at initialization use the same labels. Breaking this assumption may lead to odd behavior.

See also

In the previous recipe, we trained a NaiveBayesClassifier class, a MaxentClassifier class, and a DecisionTreeClassifier class using only the highest information words. In the next recipe, we will use the reuters corpus and combine many binary classifiers in order to create a multi-label classifier.

