Evaluating model performance on test data

All the components for the model's performance evaluation are now ready. To evaluate our model's performance on the test dataset, we will now be loading up the image features we had extracted earlier using transfer learning, which will serve as input to our models. We will also load up the captions, preprocess them, and separate them as lists of reference captions for each image, as depicted here:

test_images = list(test_df['image'].unique()) 
test_img_features = [tl_img_feature_map[img_name]  
                               for img_name in test_images] 
actual_captions = list(test_df['caption']) 
actual_captions = preprocess_captions(actual_captions) 
actual_captions = [actual_captions[x:x+5]  
                       for x in range(0, len(actual_captions),5)] 
actual_captions[:2]  
 
[['the dogs are in the snow in front of a fence', 
  'the dogs play on the snow', 
  'two brown dogs playfully fight in the snow', 
  'two brown dogs wrestle in the snow', 
  'two dogs playing in the snow'], 
 ['a brown and white dog swimming towards some in the pool', 
  'a dog in a swimming pool swims toward sombody we cannot see', 
  'a dog swims in a pool near a person', 
  'small dog is paddling through the water in a pool', 
  'the small brown and white dog is in the pool']]

You can clearly see how each image caption is now in neat separate lists that will form our reference set of captions during the computation of the BLEU scores. We can now generate BLEU scores and test the performance of our models using different values for beam size. A few examples are depicted here:

# Beam Size 1 - Model 1 with 30 epochs 
predicted_captions_ep30bs1 = [generate_image_caption(model=model1,  
                                  word_to_index_map=word_to_index,  
                                  index_to_word_map=index_to_word,  
                                          image_features=img_feat,  
                                max_caption_size=max_caption_size,  
                                beam_size=1)[0]  
                                    for img_feat  
                                         in test_img_features] 
ep30bs1_bleu = compute_bleu_evaluation( 
                    reference_captions=actual_captions, 
                    predicted_captions=predicted_captions_ep30bs1) 
 
BLEU-1: 0.5049574449416513 
BLEU-2: 0.3224643449851107 
BLEU-3: 0.22962263359362023 
BLEU-4: 0.1201459697546317 
 
 
# Beam Size 1 - Model 2 with 50 epochs 
predicted_captions_ep50bs1 = [generate_image_caption(model=model2,  
                                  word_to_index_map=word_to_index,  
                                  index_to_word_map=index_to_word,  
                                          image_features=img_feat,  
                                max_caption_size=max_caption_size,  
                                     beam_size=1)[0]  
                                        for img_feat  
                                            in test_img_features] 
ep50bs1_bleu = compute_bleu_evaluation( 
                   reference_captions=actual_captions, 
                   predicted_captions=predicted_captions_ep50bs1)

You can clearly see that scores start dropping as we start considering higher levels of n-grams. Overall, running this process is extremely time-consuming and takes a lot of time for higher orders in beam search. We tried experiments with beam sizes of 1, 3, 5, and 10. The following table depicts the model's performance for each of these experiments:

We can also visualize this easily in the form of a chart to see which combination of model parameters gave us the best model with the highest BLEU scores:

From the previous graph, it is quite clear that our model with 50 epochs and a beam size of 10 during beam search gives us the best performance based on the BLEU metrics.

Table of Contents for Evaluating model performance on test data

Create new playlist

Sign In

Sign Up

Table of Contents for
Evaluating model performance on test data