Multilabel classification

When your task is to predict more than a single label (for instance: What's the weather like today? Which flower is this? What's your job?), we call the problem a multilabel classification. Multilabel classification is a very popular task, and many performance metrics exist to evaluate classifiers. Of course, you can use all of these measures in the case of a binary classification. Now, let's explain how it works by using a simple, real-world example:

In: from sklearn import datasets
    iris = datasets.load_iris()
    # No crossvalidation for this dummy notebook
    from sklearn.model_selection import train_test_split
    X_train, X_test, Y_train, Y_test = train_test_split(iris.data, 
                        iris.target, test_size=0.50, random_state=4)
    # Use a very bad multiclass classifier
    from sklearn.tree import DecisionTreeClassifier
    classifier = DecisionTreeClassifier(max_depth=2)
    classifier.fit(X_train, Y_train) 
    Y_pred = classifier.predict(X_test)
    iris.target_names   

Out: array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

Now, let's take a look at the measures that are commonly used in multilabel classification:

Confusion matrix: Before we describe the performance metrics for multilabel classification, let's take a look at the confusion matrix, a table that gives us an idea about what the misclassifications are for each class. Ideally, in a perfect classification, all the cells that are not on the diagonal should be 0s. In the following example, you will instead see that class 0 (Setosa) is never misclassified, class 1 (Versicolor) is misclassified twice as Virginica, and class 2 (Virginica) is misclassified twice as Versicolor:

In: from sklearn import metrics
    from sklearn.metrics import confusion_matrix
    cm = confusion_matrix(y_test, y_pred)
    print cm

Out: [[30  0  0]
      [ 0 19  3]
      [ 0  2 21]]

In: import matplotlib.pyplot as plt
    img = plt.matshow(cm, cmap=plt.cm.autumn)
    plt.colorbar(img, fraction=0.045)
    for x in range(cm.shape[0]):
        for y in range(cm.shape[1]):
            plt.text(x, y, "%0.2f" % cm[x,y], 
              size=12, color='black', ha="center", va="center")
    plt.show()

The confusion matrix is represented graphically in this way:

Accuracy: Accuracy is the portion of the predicted labels that are exactly equal to the real ones. In other words, it's the percentage of overall correctly classified labels:

In: print ("Accuracy:", metrics.accuracy_score(Y_test, Y_pred))

Out: Accuracy: 0.933333333333

Precision: It is a measure that is taken from the information retrieval world. It counts the number of relevant results in the result set. Equivalently, in a classification task, it counts the number of correct labels in each set of classified labels. Then, results are averaged on all of the labels:

In: print ("Precision:", metrics.precision_score(y_test, y_pred))

Out: Precision: 0.933333333333

Recall: This is another concept taken from information retrieval. It counts the number of relevant results in the result set, compared to all of the relevant labels in the dataset. In classification tasks, this is the amount of correctly classified labels in the set divided by the total count of labels for that set. Finally, the results are averaged, just like in the following code:

In: print ("Recall:", metrics.recall_score(y_test, y_pred))

Out: Recall: 0.933333333333

F1 Score: This is the harmonic average of precision and recall, which is mostly used when dealing with unbalanced datasets in order to reveal if the classifier is performing well with all the classes:

In: print ("F1 score:", metrics.f1_score(y_test, y_pred))

Out: F1 score: 0.933267359393

These are the most used measures in multilabel classification. A convenient function, classification_report, shows a report on these measures, which is very handy. Support is simply the number of observations with that label. It's pretty useful to understand whether a dataset is balanced (that is, whether it has the same share of examples for every class) or not:

In: from sklearn.metrics import classification_report
    print classification_report(y_test, y_pred, 
                                target_names=iris.target_names)

Here is the complete report with precision, recall, f1-score and support (the number of cases for the class):

In data science practice, precision and recall are used more extensively than accuracy, as most datasets in data problems tend to be unbalanced. To account for this imbalance, data scientists often present their results in terms of precision, recall, and f1-score. In addition, we have to notice how accuracy, precision, recall, and f1-score assume values in the [0.0, 1.0] range. Perfect classifiers achieve the score of 1.0 for all of these measures (but beware of any perfect classification if it's too good to believe, as this usually means that something wrong has happened; real-world data problems never have a perfect solution).

Table of Contents for Multilabel classification

Create new playlist

Sign In

Sign Up

Table of Contents for
Multilabel classification