Evaluating the classifier performance

It is now time to put our model to the test, literally. We will use the test dataset to make predictions with our model and then evaluate them against the ground truth labels. For this, we first need to get our model's predictions on the test data and do a reverse mapping from the numeric labels to the actual text labels, using the following snippet:

predictions = model.predict_classes(test_features) 
class_map = {'0' : 'air_conditioner', '1' : 'car_horn',  
             '2' : 'children_playing', '3' : 'dog_bark',  
             '4' : 'drilling', '5' : 'engine_idling',  
             '6' : 'gun_shot', '7' : 'jackhammer',  
             '8' : 'siren', '9' : 'street_music'} 
test_labels_categories = [class_map[str(label)]for label in 
                          test_labels] 
prediction_labels_categories = [class_map[str(label)]for label in 
                 predictions] category_names = list(class_map.values())

Let's use our model_evaluation_utils module now to evaluate our model's performance on the test data. We start by getting the overall performance metrics:

meu.get_metrics(true_labels=test_labels_categories,  
                predicted_labels=prediction_labels_categories) 

Accuracy: 0.8869 
Precision: 0.8864 
Recall: 0.8869 
F1 Score: 0.8861

We get an overall model accuracy and f1-score of close to 89%, which is excellent and consistent with what we got from our validation dataset. Let's look at per class model performance next:

meu.display_classification_report(true_labels=test_labels_categories,     
                       predicted_labels=prediction_labels_categories,
                       classes=category_names) 
 
                  precision    recall  f1-score   support 
 
        car_horn       0.87      0.73      0.79       188 
           siren       0.95      0.94      0.94       750 
        drilling       0.88      0.93      0.90       697 
        gun_shot       0.94      0.94      0.94        71 
children_playing       0.83      0.79      0.81       750 
 air_conditioner       0.89      0.94      0.92       813 
      jackhammer       0.92      0.93      0.92       735 
   engine_idling       0.94      0.95      0.95       745 
        dog_bark       0.87      0.83      0.85       543 
    street_music       0.81      0.81      0.81       808 
 
     avg / total       0.89      0.89      0.89      6100

This gives us a clearer perspective on the exact classes where the model is working really well and where it might be having trouble. Most of the classes seem to be working quite well, especially device sounds like gun_shot, jackhammer, engine_idling, and so on. It seems to have the most trouble with street_music and children_playing.

The confusion matrix might help us see where the most misclassifications might be happening and help us understand this even better:

meu.display_confusion_matrix_pretty(true_labels=test_labels_categories,  
                         predicted_labels=prediction_labels_categories,
                         classes=category_names)

The matrix will appear as follows:

We can see that most of the model predictions are correct, looking at the diagonal of the matrix, and this is excellent. With regard to misclassifications, we can see that a lot of samples belonging to street_music, dog_bark, and children_playing have been misclassified amongst each other, which is kind of expected considering all of these events happen outside in the open and there is a chance they might be happening together. The same holds good for drilling and jackhammer. Thankfully, there is very little overlap in misclassifications between gun_shot and children_playing.

Thus, we can see how effective transfer learning works in this complex case study, where we leveraged an image classifier to help us build a robust and effective audio event classifier. We can now save this model for future use with the following code:

model.save('sound_classification_model.h5')

You might be thinking now that this is good. But we did everything on a static dataset. How would we use this model in the real world for audio event identification and classification? We will talk about a strategy in the next section.

Table of Contents for Evaluating the classifier performance

Create new playlist

Sign In

Sign Up

Table of Contents for
Evaluating the classifier performance