Using a confusion matrix to measure accuracy in multiclass problems

With multiclass problems, we shouldn't just be interested in how well we manage to correctly classify the genres. We should also look into which genres we confuse with each other. This can be done with the appropriately named confusion matrix, which you may have noticed is part of the training procedure:

>>> cm = confusion_matrix(y_test, y_pred)

If we print out the confusion matrix, we would see something like the following:

 [[26 1 2 0 0 2]
 [ 4 7 5 0 5 3]
 [ 1 2 14 2 8 3]
 [ 5 4 7 3 7 5]
 [ 0 0 10 2 10 12]
 [ 1 0 4 0 13 12]]

This is the distribution of labels that the classifier predicted for the test set for every genre. The diagonal represents the correct classifications. Since we have six genres, we have a six-by-six matrix. The first row in the matrix says that for 31 classical songs (the sum of first row), it predicted 26 to belong to the classical genre, 1 to be a jazz song, 2 to be country, and 2 to be metal. The diagonal shows the correct classifications. In the first row, we see that out of (26+1+2+2)=31 songs, 26 have been correctly classified as classical and 5 were misclassifications. This is actually not that bad. The second row is more sobering: only 7 out of 24 jazz songs have been correctly classified–that is, only 29%.

Of course, we follow the train/test split setup from the previous chapters, so that we actually have to record the confusion matrices per cross-validation fold. We have to average and normalize later on, so that we have a range between 0 (total failure) and 1 (everything classified correctly).

A graphical visualization is often much easier to read than NumPy arrays. The matshow() function of matplotlib is our friend:

from matplotlib import pylab as plt

def plot_confusion_matrix(cm, genre_list, name, title):
    plt.clf()
    plt.matshow(cm, fignum=False, cmap='Blues', vmin=0, vmax=1.0)
    ax = plt.axes()
    ax.set_xticks(range(len(genre_list)))
    ax.set_xticklabels(genre_list)
    ax.xaxis.set_ticks_position("bottom")
    ax.set_yticks(range(len(genre_list)))
    ax.set_yticklabels(genre_list)
    ax.tick_params(axis='both', which='both', bottom='off', left='off')
    plt.title(title)
    plt.colorbar()
    plt.grid(False)
    plt.show()
    plt.xlabel('Predicted class')
    plt.ylabel('True class')
    plt.grid(False)

When you create a confusion matrix, be sure to choose a color map (the cmap parameter of matshow()) with an appropriate color ordering so that it is immediately visible what a lighter or darker color means. Especially discouraged for these kinds of graphs are rainbow color maps, such as, matplotlib instance default jet or even the paired color map.

The final graph looks like the following:

For a perfect classifier, we would have expected a diagonal of dark squares from the upper-left corner to the lower-right one, and light colors for the remaining areas. In the preceding graph, we immediately see that our FFT-based classifier is far from perfect. It only predicts classical songs correctly (dark square). For rock, for instance, it preferred the label metal most of the time.

Obviously, using FFT points us in the right direction (the classical genre was not that bad), but is not enough to get a decent classifier. Surely, we can play with the number of FFT components (fixed to 1,000). But before we dive into parameter tuning, we should do our research. There we find that FFT is indeed not a bad feature for genre classification–it is just not refined enough. Shortly, we will see how we can boost our classification performance by using a processed version of it.

Before we do that, however, we will learn another method of measuring classification performance.

Table of Contents for Using a confusion matrix to measure accuracy in multiclass problems

Create new playlist

Sign In

Sign Up

Table of Contents for
Using a confusion matrix to measure accuracy in multiclass problems