Looking behind accuracy – precision and recall

Let's step back and think again about what we are trying to achieve. Actually, we do not need a classifier that perfectly predicts good and bad answers, as we measured it until now using accuracy. If we can tune the classifier to be particularly good at predicting one class, we could adapt the feedback to the user accordingly. If we, for example, had a classifier that was always right when it predicted an answer to be bad, we would give no feedback until the classifier detected the answer to be bad. On the contrary, if the classifier exceeded in predicting answers to be good, we could show helpful comments to the user at the beginning and remove them when the classifier said that the answer is a good one.

To find out which situation we are in, we have to understand how to measure precision and recall. And to understand that, we have to look into the four distinct classification results as they are described in the following table:

For instance, if the classifier predicts an instance to be positive and the instance is indeed positive, this is a true positive instance. If, on the other hand, the classifier misclassified that instance, saying that it is negative while in reality it was positive, that instance is said to be a false negative.

What we want is to have a high success rate when we are predicting a post as either good or bad, but not necessarily both. That is, we want as many true positives as possible. This is what precision captures:

If, instead, our goal have been to detect as many good or bad answers as possible, we would be more interested in recall:

Refer to the following graph:

So, how can we now optimize for precision? So far, we have always used 0.5 as the threshold to decide whether an answer is good or not. What we can do now is count the number of TP, FP, and FN while we vary that threshold between 0 and 1. With those counts, we can then plot precision over recall.

The handy precision_recall_curve() function from the metrics module does all the calculations for us:

>>> from sklearn.metrics import precision_recall_curve
>>> # X_test would come from KFold’s train/test split
>>> precision, recall, thresholds = precision_recall_curve(y_test, 
    clf.predict(X_test))

Predicting one class with acceptable performance does not always mean that the classifier is also acceptable at predicting the other class. This can be seen in the following two plots, where we plot the precision/recall curves for classifying bad (the left graph) and good (the right graph) answers:

In the graphs, we have also included a much better description of a classifier's performance, the area under curve (AUC). It can be understood as the average precision of the classifier and is a great way of comparing different classifiers.

Predicting good answers shows that we can get 80% precision at a recall of 20%, while we have only less than 10% recall when we want to achieve 80% prediction on poor answers.

Let's find out what threshold we need for that. As we trained many classifiers on different folds (remember, we iterated over KFold() a couple of pages back), we need to retrieve the classifier that was neither too bad nor too good in order to get a realistic view. Let's call it the medium clone:

>>> medium = np.argsort(scores)[ len(scores) // 2)]
>>> thresholds = np.hstack(([0],thresholds[medium]))
>>> for precision in np.arange(0.77, 0.8, 0.01):
...    thresh_idx = precisions >= precision
P=0.77 R=0.25 thresh=0.62
P=0.78 R=0.23 thresh=0.65
P=0.79 R=0.21 thresh=0.66
P=0.80 R=0.13 thresh=0.74

Setting the threshold at 0.66, we see that we can still achieve a precision of 79% at detecting good answers when we accept a low recall of 21%. That means that we would detect only one in three good answers. But from that third of good answers we'd manage to detect, we would be reasonably sure that they are indeed good. For the rest, we could then politely display additional hints on how to improve answers in general.

Table of Contents for Looking behind accuracy&#xA0;&#x2013; precision and recall

Create new playlist

Sign In

Sign Up

Table of Contents for
Looking behind accuracy – precision and recall