The first step for our model is to leverage a pretrained DCNN model, using principles of transfer learning to extract the right features from our source images. To keep things simple, we will not be fine-tuning or connecting the VGG-16 model to the rest of our model architecture. We will be extracting the bottleneck features from all our images beforehand to speed up training later, since building a sequence model with several LSTMs will take a lot of training time even on GPUs, as we will see shortly.
To get started, we will load up all the source image filenames and their corresponding captions from the Flickr8k_text folder in the source dataset. Also we will combine the dev and train dataset images together, as we mentioned before:
import pandas as pd import numpy as np # read train image file names with open('../Flickr8k_text/Flickr_8k.trainImages.txt','r') as tr_imgs: train_imgs = tr_imgs.read().splitlines() # read dev image file names with open('../Flickr8k_text/Flickr_8k.devImages.txt','r') as dv_imgs: dev_imgs = dv_imgs.read().splitlines() # read test image file names with open('../Flickr8k_text/Flickr_8k.testImages.txt','r') as ts_imgs: test_imgs = ts_imgs.read().splitlines() # read image captions with open('../Flickr8k_text/Flickr8k.token.txt','r') as img_tkns: captions = img_tkns.read().splitlines() # combine dev and train image names into one set train_imgs = train_imgs + dev_imgs
Now that we have the input image filenames sorted out and the corresponding captions loaded up, we need to build a dictionary-based map that maps a source image with its corresponding captions. As we mentioned earlier, one image was captioned by five different people and hence we will have a list of five captions for each image. The following code helps us do this:
from collections import defaultdict caption_map = defaultdict(list) # store five captions in a list for each image for record in captions: record = record.split(' ') img_name = record[0][:-2] img_caption = record[1].strip() caption_map[img_name].append(img_caption)
We will be leveraging this later on when we build our datasets for training and testing. Let's now focus on feature extraction. Before the image feature extraction, we need to preprocess the raw input images to the right size and scale the pixel values based on the model we will be using. The following code will help us with the necessary image preprocessing steps:
from keras.preprocessing import image from keras.applications.vgg16 import preprocess_input as preprocess_vgg16_input def process_image2arr(path, img_dims=(224, 224)): img = image.load_img(path, target_size=img_dims) img_arr = image.img_to_array(img) img_arr = np.expand_dims(img_arr, axis=0) img_arr = preprocess_vgg16_input(img_arr) return img_arr
We will also need to load up the pretrained VGG-16 model to leverage transfer learning. This is achieved with the following code snippet:
from keras.applications import vgg16 from keras.models import Model vgg_model = vgg16.VGG16(include_top=True, weights='imagenet', input_shape=(224, 224, 3)) vgg_model.layers.pop() output = vgg_model.layers[-1].output vgg_model = Model(vgg_model.input, output) vgg_model.trainable = False vgg_model.summary()
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) (None, 224, 224, 3) 0 _________________________________________________________________ block1_conv1 (Conv2D) (None, 224, 224, 64) 1792 _________________________________________________________________ ... ... block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808 _________________________________________________________________ block5_pool (MaxPooling2D) (None, 7, 7, 512) 0 _________________________________________________________________ flatten (Flatten) (None, 25088) 0 _________________________________________________________________ fc1 (Dense) (None, 4096) 102764544 _________________________________________________________________ fc2 (Dense) (None, 4096) 16781312 ================================================================= Total params: 134,260,544 Trainable params: 0 Non-trainable params: 134,260,544 _________________________________________________________________
It is quite evident that we remove the softmax layer and make the model non-trainable, since we are only interested in extracting dense feature vectors from the input images. We will now build a function that leverages our utility functions and helps extract the right features from input images:
def extract_tl_features_vgg(model, image_file_name,
image_dir='../Flickr8k_imgs/'): pr_img = process_image2arr(image_dir+image_file_name) tl_features = model.predict(pr_img) tl_features = np.reshape(tl_features, tl_features.shape[1]) return tl_features
Let's now put all our previous functions and our pretrained model to the test by extracting image features and building out our train and test datasets:
img_tl_featureset = dict() train_img_names = [] train_img_captions = [] test_img_names = [] test_img_captions = [] for img in train_imgs: img_tl_featureset[img] = extract_tl_features_vgg(model=vgg_model,
image_file_name=img) for caption in caption_map[img]: train_img_names.append(img) train_img_captions.append(caption) for img in test_imgs: img_tl_featureset[img] = extract_tl_features_vgg(model=vgg_model,
image_file_name=img) for caption in caption_map[img]: test_img_names.append(img) test_img_captions.append(caption) train_dataset = pd.DataFrame({'image': train_img_names, 'caption':
train_img_captions}) test_dataset = pd.DataFrame({'image': test_img_names, 'caption':
test_img_captions}) print('Train Dataset Size:', len(train_dataset), ' Test Dataset Size:', len(test_dataset)) Train Dataset Size: 35000 Test Dataset Size: 5000
We can also see what the train dataset looks like by using the following code:
train_dataset.head(10)
The output of the preceding code is as follows:
It is quite evident that we have five captions for each input image and we maintain that in our datasets. We will now save the record of these datasets and our image features learned from transfer learning to the disk so we can easily load it up in memory during model training instead of extracting these features every time we want to run our model:
# save dataset records train_dataset = train_dataset[['image', 'caption']] test_dataset = test_dataset[['image', 'caption']] train_dataset.to_csv('image_train_dataset.tsv', sep=' ', index=False) test_dataset.to_csv('image_test_dataset.tsv', sep=' ', index=False) # save transfer learning image features from sklearn.externals import joblib joblib.dump(img_tl_featureset, 'transfer_learn_img_features.pkl') ['transfer_learn_img_features.pkl']
Also, if needed, you can validate how the image features look by using the following code snippets for some initial checks:
[(key, value.shape) for key, value in img_tl_featureset.items()][:5] [('3079787482_0757e9d167.jpg', (4096,)), ('3284955091_59317073f0.jpg', (4096,)), ('1795151944_d69b82f942.jpg', (4096,)), ('3532192208_64b069d05d.jpg', (4096,)), ('454709143_9c513f095c.jpg', (4096,))] [(k, np.round(v, 3)) for k, v in img_tl_featureset.items()][:5] [('3079787482_0757e9d167.jpg', array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)), ('3284955091_59317073f0.jpg', array([0.615, 0. , 0.653, ..., 0. , 1.559, 2.614], dtype=float32)), ('1795151944_d69b82f942.jpg', array([0. , 0. , 0. , ..., 0. , 0. , 0.538], dtype=float32)), ('3532192208_64b069d05d.jpg', array([0. , 0. , 0. , ..., 0. , 0. , 2.293], dtype=float32)), ('454709143_9c513f095c.jpg', array([0. , 0. , 0.131, ..., 0.833, 4.263, 0. ], dtype=float32))]
We will be using these features in the next part of our modeling.