In addition to accuracy, there are a number of other metrics used to evaluate classifiers. Two of the most common are precision and recall. To understand these two metrics, we must first understand false positives and false negatives. False positives happen when a classifier classifies a feature set with a label it shouldn't have gotten. False negatives happen when a classifier doesn't assign a label to a feature set that should have it. In a binary classifier, these errors happen at the same time.
Here's an example: the classifier classifies a movie review as pos
when it should have been neg
. This counts as a false positive for the pos
label, and a false negative for the neg
label. If the classifier had correctly guessed neg
, then it would count as a true positive for the neg
label, and a true negative for the pos
label.
How does this apply to precision and recall? Precision is the lack of false positives, and recall is the lack of false negatives. As you will see, these two metrics are often in competition: the more precise a classifier is, the lower the recall, and vice versa.
Let's calculate the precision and recall of the NaiveBayesClassifier
class we trained in the Training a Naive Bayes classifier recipe. The precision_recall()
function in classification.py
looks like this:
import collections from nltk import metrics def precision_recall(classifier, testfeats): refsets = collections.defaultdict(set) testsets = collections.defaultdict(set) for i, (feats, label) in enumerate(testfeats): refsets[label].add(i) observed = classifier.classify(feats) testsets[observed].add(i) precisions = {} recalls = {} for label in classifier.labels(): precisions[label] = metrics.precision(refsets[label], testsets[label]) recalls[label] = metrics.recall(refsets[label], testsets[label]) return precisions, recalls
This function takes two arguments:
These are the same arguments you pass to accuracy()
. The
precision_recall()
function returns two dictionaries; the first holds the precision for each label, and the second holds the recall for each label. Here's an example usage with nb_classifier
and test_feats
we created in the Training a Naive Bayes classifier recipe earlier:
>>> from classification import precision_recall >>> nb_precisions, nb_recalls = precision_recall(nb_classifier, test_feats) >>> nb_precisions['pos'] 0.6413612565445026 >>> nb_precisions['neg'] 0.9576271186440678 >>> nb_recalls['pos'] 0.98 >>> nb_recalls['neg'] 0.452
This tells us that while the NaiveBayesClassifier
class can correctly identify most of the pos
feature sets (high recall), it also classifies many of the neg
feature sets as pos
(low precision). This behavior contributes to high precision but low recall for the neg
label—as the neg
label isn't given often (low recall), when it is, it's very likely to be correct (high precision). The conclusion could be that there are certain common words that are biased towards the pos
label, but occur frequently enough in the neg
feature sets to cause mis-classifications. To correct this behavior, we will use only the most informative words in the next recipe, Calculating high information words.
To calculate precision and recall, we must build two sets for each label. The first set is known as the reference set, and contains all the correct values. The second set is called the test set, and contains the values guessed by the classifier. These two sets are compared to calculate the precision or recall for each label.
Precision is defined as the size of the intersection of both sets divided by the size of the test set. In other words, the percentage of the test set that was guessed correctly. In Python, the code is float(len(reference.intersection(test))) / len(test)
.
Recall is the size of the intersection of both sets divided by the size of the reference set, or the percentage of the reference set that was guessed correctly. The Python code is float(len(reference.intersection(test))) / len(reference)
.
The precision_recall()
function in classification.py
iterates over the labeled test features and classifies each one. We store the numeric index of the feature set (starting with 0
) in the reference set for the known training label, and also store the index in the test set for the guessed label. If the classifier guesses pos
but the training label is neg
, then the index is stored in the reference set for neg
and the test set for pos
.
The nltk.metrics
package contains functions for calculating both precision and recall, so all we really have to do is build the sets and then call the appropriate function.
Let's try it with the MaxentClassifier
class of GIS, which we trained in the Training a maximum entropy classifier recipe:
>>> me_precisions, me_recalls = precision_recall(me_classifier, test_feats) >>> me_precisions['pos'] 0.6456692913385826 >>> me_precisions['neg'] 0.9663865546218487 >>> me_recalls['pos'] 0.984 >>> me_recalls['neg'] 0.46
This classifier is just as biased as the NaiveBayesClassifier
class. Chances are it would be less biased if allowed to train for more iterations and/or approach a smaller log likelihood change. Now, let's try the SklearnClassifier
class of NuSVC
from the previous recipe, Training scikit-learn classifiers:
>>> sk_precisions, sk_recalls = precision_recall(sk_classifier, test_feats) >>> sk_precisions['pos'] 0.9063829787234042 >>> sk_precisions['neg'] 0.8603773584905661 >>> sk_recalls['pos'] 0.852 >>> sk_recalls['neg'] 0.912
In this case, the label bias is much less significant, and the reason is that the SklearnClassifier
class of NuSVC
weighs its features according to its own internal model. This is also true for logistic regression and many of the other scikit-learn
algorithms. Words that are more significant are those that occur primarily in a single label, and will get higher weights in the model. Words that are common to both labels will get lower weights, as they are less significant.
The F-measure is defined as the weighted harmonic mean of precision and recall. If p is the precision, and r is the recall, the formula is:
1/(alpha/p + (1-alpha)/r)
Here, alpha is a weighing constant that defaults to 0.5
. You can use nltk.metrics.f_measure()
to get the F-measure. It takes the same arguments as for the precision()
and recall()
functions: a reference set and a test set. It's often used instead of accuracy to measure a classifier, because if either precision or recall are very low, it will be reflected in the F-measure, but not necessarily in the accuracy. However, I find precision and recall to be much more useful metrics by themselves, as the F-measure can obscure the kinds of imbalances we saw with the NaiveBayesClassifier
class.
In the Training a Naive Bayes classifier recipe, we collected training and testing feature sets and trained the NaiveBayesClassifier
class. The MaxentClassifier
class was trained in the Training a maximum entropy classifier recipe, and the SklearnClassifier
class was trained in the Training scikit-learn classifiers recipe. In the next recipe, we will explore eliminating the less significant words, and use only the high information words to create our feature sets.