We want to be a little more formal when we talk about a good classifier. What does that mean? The performance of a classifier is a measure of its effectiveness. The simplest performance measure is accuracy: given a classifier and an evaluation dataset, it measures the proportion of instances correctly classified by the classifier. First, let's test the accuracy on the training set:
>>> from sklearn import metrics >>> y_train_pred = clf.predict(X_train) >>> print metrics.accuracy_score(y_train, y_train_pred) 0.821428571429
This figure tells us that 82 percent of the training set instances are correctly classified by our classifier.
Probably, the most important thing you should learn from this chapter is that measuring accuracy on the training set is really a bad idea. You have built your model using this data, and it is possible that your model adjusts well to them but performs poorly in future (previously unseen data), which is its purpose. This phenomenon is called overfitting, and you will see it now and again while you read this book. If you measure based on your training data, you will never detect overfitting. So, never measure based on your training data.
This is why we have reserved part of the original dataset (the testing partition)—we want to evaluate performance on previously unseen data. Let's check the accuracy again, now on the evaluation set (recall that it was already scaled):
>>> y_pred = clf.predict(X_test) >>> print metrics.accuracy_score(y_test, y_pred) 0.684210526316
We obtained an accuracy of 68 percent in our testing set. Usually, accuracy on the testing set is lower than the accuracy on the training set, since the model is actually modeling the training set, not the testing set. Our goal will always be to produce models that avoid overfitting when trained over a training set, so they have enough generalization power to also correctly model the unseen data.
Accuracy on the test set is a good performance measure when the number of instances of each class is similar, that is, we have a uniform distribution of classes. But if you have a skewed distribution (say, 99 percent of the instances belong to one class), a classifier that always predicts the majority class could have an excellent performance in terms of accuracy despite the fact that it is an extremely naive method.
Within scikit-learn, there are several evaluation functions; we will show three popular ones: precision, recall, and F1-score (or f-measure). They assume a binary classification problem and two classes—a positive one and a negative one. In our example, the positive class could be Iris setosa, while the other two will be combined into one negative class.
The harmonic mean is used instead of the arithmetic mean because the latter compensates low values for precision and with high values for recall (and vice versa). On the other hand, with harmonic mean we will always have low values if either precision or recall is low. For an interesting description of this issue refer to the paper http://www.cs.odu.edu/~mukka/cs795sum12dm/Lecturenotes/Day3/F-measure-YS-26Oct07.pdf
We can define these measures in terms of True and False, and Positives and Negatives:
Prediction: Positive |
Prediction: Negative | |
---|---|---|
Target cass: Positive |
True Positive (TP) |
False Negative (FN) |
Target cass: Negative |
False Positive (FP) |
True Negative (TN) |
With m being the sample size (that is, TP + TN + FP + FN), we have the following formulae:
Let's see it in practice:
>>> print metrics.classification_report(y_test, y_pred,target_names=iris.target_names) precision recall f1-score support setosa 1.00 1.00 1.00 8 versicolor 0.43 0.27 0.33 11 virginica 0.65 0.79 0.71 19 avg / total 0.66 0.68 0.66 38
We have computed precision, recall, and f1-score for each class and their average values. What we can see in this table is:
setosa
class. This means that for precision, 100 percent of the instances that are classified as setosa are really setosa instances, and for recall, that 100 percent of the setosa instances were classified as setosa.versicolor
class, the results are not as good: we have a precision of 0.43, that is, only 43 percent of the instances that are classified as versicolor are really versicolor instances. Also, for versicolor, we have a recall of 0.27, that is, only 27 percent of the versicolor instances are correctly classified.Now, we can see that our method (as we expected) is very good at predicting setosa
, while it suffers when it has to separate the versicolor
or virginica
classes. The support value shows how many instances of each class we had in the testing set.
Another useful metric (especially for multi-class problems) is the confusion matrix: in its (i, j)
cell, it shows the number of class instances i
that were predicted to be in class j
. A good classifier will accumulate the values on the confusion matrix diagonal, where correctly classified instances belong.
>>> print metrics.confusion_matrix(y_test, y_pred) [[ 8 0 0] [ 0 3 8] [ 0 4 15]]
Our classifier is never wrong in our evaluation set when it classifies class 0
(setosa
) flowers. But, when it faces classes 1
and 2
flowers (versicolor
and virginica
), it confuses them. The confusion matrix gives us useful information to know what types of errors the classifier is making.
To finish our evaluation process, we will introduce a very useful method known as cross-validation. As we explained before, we have to partition our dataset into a training set and a testing set. However, partitioning the data, results such that there are fewer instances to train on, and also, depending on the particular partition we make (usually made randomly), we can get either better or worse results. Cross-validation allows us to avoid this particular case, reducing result variance and producing a more realistic score for our models. The usual steps for k-fold cross-validation are the following:
Let's do that with our linear classifier. First, we will have to create a composite estimator made by a pipeline of the standardization and linear models. With this technique, we make sure that each iteration will standardize the data and then train/test on the transformed data. The Pipeline
class is also useful to simplify the construction of more complex models that chain-multiply the transformations. We will chose to have k = 5 folds, so each time we will train on 80 percent of the data and test on the remaining 20 percent. Cross-validation, by default, uses accuracy as its performance measure, but we could select the measurement by passing any scorer function as an argument.
>>> from sklearn.cross_validation import cross_val_score, KFold >>> from sklearn.pipeline import Pipeline >>> # create a composite estimator made by a pipeline of the standarization and the linear model clf = Pipeline([ ('scaler', preprocessing.StandardScaler()), ('linear_model', SGDClassifier()) ]) >>> # create a k-fold cross validation iterator of k=5 folds >>> cv = KFold(X.shape[0], 5, shuffle=True, random_state=33) >>> # by default the score used is the one returned by score method of the estimator (accuracy) >>> scores = cross_val_score(clf, X, y, cv=cv) >>> print scores [ 0.66666667 0.93333333 0.66666667 0.7 0.6 ]
We obtained an array with the k scores. We can calculate the mean and the standard error to obtain a final figure:
>>> from scipy.stats import sem >>> def mean_score(scores): return ("Mean score: {0:.3f} (+/- {1:.3f})").format(np.mean(scores), sem(scores)) >>> print mean_score(scores) Mean score: 0.713 (+/-0.057)
Our model has an average accuracy of 0.71.