Evaluating the classifier performance

It is now time to put our model to the test, literally. We will use the test dataset to make predictions with our model and then evaluate them against the ground truth labels. For this, we first need to get our model's predictions on the test data and do a reverse mapping from the numeric labels to the actual text labels, using the following snippet:

predictions = model.predict_classes(test_features) 
class_map = {'0' : 'air_conditioner', '1' : 'car_horn',  
             '2' : 'children_playing', '3' : 'dog_bark',  
             '4' : 'drilling', '5' : 'engine_idling',  
             '6' : 'gun_shot', '7' : 'jackhammer',  
             '8' : 'siren', '9' : 'street_music'} 
test_labels_categories = [class_map[str(label)]for label in 
test_labels] prediction_labels_categories = [class_map[str(label)]for label in
predictions] category_names = list(class_map.values())

Let's use our model_evaluation_utils module now to evaluate our model's performance on the test data. We start by getting the overall performance metrics:

meu.get_metrics(true_labels=test_labels_categories,  
                predicted_labels=prediction_labels_categories) 

Accuracy: 0.8869 Precision: 0.8864 Recall: 0.8869 F1 Score: 0.8861

We get an overall model accuracy and f1-score of close to 89%, which is excellent and consistent with what we got from our validation dataset. Let's look at per class model performance next:

meu.display_classification_report(true_labels=test_labels_categories,     
predicted_labels=prediction_labels_categories,
classes=category_names) precision recall f1-score support car_horn 0.87 0.73 0.79 188 siren 0.95 0.94 0.94 750 drilling 0.88 0.93 0.90 697 gun_shot 0.94 0.94 0.94 71 children_playing 0.83 0.79 0.81 750 air_conditioner 0.89 0.94 0.92 813 jackhammer 0.92 0.93 0.92 735 engine_idling 0.94 0.95 0.95 745 dog_bark 0.87 0.83 0.85 543 street_music 0.81 0.81 0.81 808 avg / total 0.89 0.89 0.89 6100

This gives us a clearer perspective on the exact classes where the model is working really well and where it might be having trouble. Most of the classes seem to be working quite well, especially device sounds like gun_shot, jackhammer, engine_idling, and so on. It seems to have the most trouble with street_music and children_playing.

The confusion matrix might help us see where the most misclassifications might be happening and help us understand this even better:

meu.display_confusion_matrix_pretty(true_labels=test_labels_categories,  
predicted_labels=prediction_labels_categories,
classes=category_names)

The matrix will appear as follows:

We can see that most of the model predictions are correct, looking at the diagonal of the matrix, and this is excellent. With regard to misclassifications, we can see that a lot of samples belonging to street_music, dog_bark, and children_playing have been misclassified amongst each other, which is kind of expected considering all of these events happen outside in the open and there is a chance they might be happening together. The same holds good for drilling and jackhammer. Thankfully, there is very little overlap in misclassifications between gun_shot and children_playing.

Thus, we can see how effective transfer learning works in this complex case study, where we leveraged an image classifier to help us build a robust and effective audio event classifier. We can now save this model for future use with the following code:

model.save('sound_classification_model.h5') 

You might be thinking now that this is good. But we did everything on a static dataset. How would we use this model in the real world for audio event identification and classification? We will talk about a strategy in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset