Measuring precision and recall of a classifier

In addition to accuracy, there are a number of other metrics used to evaluate classifiers. Two of the most common are precision and recall. To understand these two metrics, we must first understand false positives and false negatives. False positives happen when a classifier classifies a feature set with a label it shouldn't have gotten. False negatives happen when a classifier doesn't assign a label to a feature set that should have it. In a binary classifier, these errors happen at the same time.

Here's an example: the classifier classifies a movie review as pos when it should have been neg. This counts as a false positive for the pos label, and a false negative for the neg label. If the classifier had correctly guessed neg, then it would count as a true positive for the neg label, and a true negative for the pos label.

How does this apply to precision and recall? Precision is the lack of false positives, and recall is the lack of false negatives. As you will see, these two metrics are often in competition: the more precise a classifier is, the lower the recall, and vice versa.

How to do it...

Let's calculate the precision and recall of the NaiveBayesClassifier class we trained in the Training a Naive Bayes classifier recipe. The precision_recall() function in looks like this:

import collections
from nltk import metrics

def precision_recall(classifier, testfeats):
  refsets = collections.defaultdict(set)
  testsets = collections.defaultdict(set)

  for i, (feats, label) in enumerate(testfeats):
    observed = classifier.classify(feats)

  precisions = {}
  recalls = {}

  for label in classifier.labels():
    precisions[label] = metrics.precision(refsets[label], testsets[label])
    recalls[label] = metrics.recall(refsets[label], testsets[label])

  return precisions, recalls

This function takes two arguments:

  • The trained classifier
  • Labeled test features, also known as a gold standard

These are the same arguments you pass to accuracy(). The precision_recall() function returns two dictionaries; the first holds the precision for each label, and the second holds the recall for each label. Here's an example usage with nb_classifier and test_feats we created in the Training a Naive Bayes classifier recipe earlier:

>>> from classification import precision_recall
>>> nb_precisions, nb_recalls = precision_recall(nb_classifier, test_feats)
>>> nb_precisions['pos']
>>> nb_precisions['neg']
>>> nb_recalls['pos']
>>> nb_recalls['neg']

This tells us that while the NaiveBayesClassifier class can correctly identify most of the pos feature sets (high recall), it also classifies many of the neg feature sets as pos (low precision). This behavior contributes to high precision but low recall for the neg label—as the neg label isn't given often (low recall), when it is, it's very likely to be correct (high precision). The conclusion could be that there are certain common words that are biased towards the pos label, but occur frequently enough in the neg feature sets to cause mis-classifications. To correct this behavior, we will use only the most informative words in the next recipe, Calculating high information words.

How it works...

To calculate precision and recall, we must build two sets for each label. The first set is known as the reference set, and contains all the correct values. The second set is called the test set, and contains the values guessed by the classifier. These two sets are compared to calculate the precision or recall for each label.

Precision is defined as the size of the intersection of both sets divided by the size of the test set. In other words, the percentage of the test set that was guessed correctly. In Python, the code is float(len(reference.intersection(test))) / len(test).

Recall is the size of the intersection of both sets divided by the size of the reference set, or the percentage of the reference set that was guessed correctly. The Python code is float(len(reference.intersection(test))) / len(reference).

The precision_recall() function in iterates over the labeled test features and classifies each one. We store the numeric index of the feature set (starting with 0) in the reference set for the known training label, and also store the index in the test set for the guessed label. If the classifier guesses pos but the training label is neg, then the index is stored in the reference set for neg and the test set for pos.


We use the numeric index because the feature sets aren't hashable, and we need a unique value for each feature set.

The nltk.metrics package contains functions for calculating both precision and recall, so all we really have to do is build the sets and then call the appropriate function.

There's more...

Let's try it with the MaxentClassifier class of GIS, which we trained in the Training a maximum entropy classifier recipe:

>>> me_precisions, me_recalls = precision_recall(me_classifier, test_feats)
>>> me_precisions['pos']
>>> me_precisions['neg']
>>> me_recalls['pos']
>>> me_recalls['neg'] 

This classifier is just as biased as the NaiveBayesClassifier class. Chances are it would be less biased if allowed to train for more iterations and/or approach a smaller log likelihood change. Now, let's try the SklearnClassifier class of NuSVC from the previous recipe, Training scikit-learn classifiers:

>>> sk_precisions, sk_recalls = precision_recall(sk_classifier, test_feats)
>>> sk_precisions['pos']
>>> sk_precisions['neg']
>>> sk_recalls['pos']
>>> sk_recalls['neg']

In this case, the label bias is much less significant, and the reason is that the SklearnClassifier class of NuSVC weighs its features according to its own internal model. This is also true for logistic regression and many of the other scikit-learn algorithms. Words that are more significant are those that occur primarily in a single label, and will get higher weights in the model. Words that are common to both labels will get lower weights, as they are less significant.


The F-measure is defined as the weighted harmonic mean of precision and recall. If p is the precision, and r is the recall, the formula is:

1/(alpha/p + (1-alpha)/r)

Here, alpha is a weighing constant that defaults to 0.5. You can use nltk.metrics.f_measure() to get the F-measure. It takes the same arguments as for the precision() and recall() functions: a reference set and a test set. It's often used instead of accuracy to measure a classifier, because if either precision or recall are very low, it will be reflected in the F-measure, but not necessarily in the accuracy. However, I find precision and recall to be much more useful metrics by themselves, as the F-measure can obscure the kinds of imbalances we saw with the NaiveBayesClassifier class.

See also

In the Training a Naive Bayes classifier recipe, we collected training and testing feature sets and trained the NaiveBayesClassifier class. The MaxentClassifier class was trained in the Training a maximum entropy classifier recipe, and the SklearnClassifier class was trained in the Training scikit-learn classifiers recipe. In the next recipe, we will explore eliminating the less significant words, and use only the high information words to create our feature sets.

