Evaluating on our test dataset was a good way to test the model's performance, but how do we start using the model in the real world and caption completely new photos? This is where we need some knowledge of building an end-to-end system, which takes in any image as an input and gives us a free-text natural-language caption as the output.
Here are the major components and functions for our automated caption generator:
- Caption model and metadata initializer
- Image feature extraction model initializer
- Transfer learning-based feature extractor
- Caption generator
To make this generic, we built a class that makes use of several utility functions we mentioned in the previous sections:
from keras.preprocessing import image from keras.applications.vgg16 import preprocess_input as preprocess_vgg16_input from keras.applications import vgg16 from keras.models import Model class CaptionGenerator: def __init__(self, image_locations=[], word_to_index_map=None, index_to_word_map=None, max_caption_size=None, caption_model=None, beam_size=1): self.image_locs = image_locations self.captions = [] self.image_feats = [] self.word2index = word_to_index_map self.index2word = index_to_word_map self.max_caption_size = max_caption_size self.vision_model = None self.caption_model = caption_model self.beam_size = beam_size def process_image2arr(self, path, img_dims=(224, 224)): img = image.load_img(path, target_size=img_dims) img_arr = image.img_to_array(img) img_arr = np.expand_dims(img_arr, axis=0) img_arr = preprocess_vgg16_input(img_arr) return img_arr def initialize_model(self): vgg_model = vgg16.VGG16(include_top=True, weights='imagenet', input_shape=(224, 224, 3)) vgg_model.layers.pop() output = vgg_model.layers[-1].output vgg_model = Model(vgg_model.input, output) vgg_model.trainable = False self.vision_model = vgg_model def process_images(self): if self.image_locs: image_feats = [self.vision_model.predict
(self.process_image2arr
(path=img_path)) for img_path
in self.image_locs] image_feats = [np.reshape(img_feat, img_feat.shape[1]) for
img_feat in image_feats] self.image_feats = image_feats else: print('No images specified') def generate_captions(self): captions = [generate_image_caption(model=self.caption_model,
word_to_index_map=self.word2index, index_to_word_map=self.index2word,
image_features=img_feat, max_caption_size=self.max_caption_size, beam_size=self.beam_size)[0] for img_feat in self.image_feats] self.captions = captions
Now that our caption generator has been implemented, it is time to put it into action! For the purpose of testing our caption generator, we have downloaded several images that are completely new and not present in the Flickr8K dataset. We downloaded specific images from Flickr that adhere to necessary commercial-usage-based licenses so that we can depict them in this book. We'll show some demonstrations in the next section.