Image feature extraction with transfer learning

The first step for our model is to leverage a pretrained DCNN model, using principles of transfer learning to extract the right features from our source images. To keep things simple, we will not be fine-tuning or connecting the VGG-16 model to the rest of our model architecture. We will be extracting the bottleneck features from all our images beforehand to speed up training later, since building a sequence model with several LSTMs will take a lot of training time even on GPUs, as we will see shortly.

To get started, we will load up all the source image filenames and their corresponding captions from the Flickr8k_text folder in the source dataset. Also we will combine the dev and train dataset images together, as we mentioned before:

import pandas as pd 
import numpy as np 
 
# read train image file names 
with open('../Flickr8k_text/Flickr_8k.trainImages.txt','r') as tr_imgs: 
    train_imgs = tr_imgs.read().splitlines() 
 
# read dev image file names     
with open('../Flickr8k_text/Flickr_8k.devImages.txt','r') as dv_imgs: 
    dev_imgs = dv_imgs.read().splitlines() 
 
# read test image file names     
with open('../Flickr8k_text/Flickr_8k.testImages.txt','r') as ts_imgs: 
    test_imgs = ts_imgs.read().splitlines() 
 
# read image captions     
with open('../Flickr8k_text/Flickr8k.token.txt','r') as img_tkns: 
    captions = img_tkns.read().splitlines() 
# combine dev and train image names into one set 
train_imgs = train_imgs + dev_imgs

Now that we have the input image filenames sorted out and the corresponding captions loaded up, we need to build a dictionary-based map that maps a source image with its corresponding captions. As we mentioned earlier, one image was captioned by five different people and hence we will have a list of five captions for each image. The following code helps us do this:

from collections import defaultdict 
 
caption_map = defaultdict(list) 
# store five captions in a list for each image 
for record in captions: 
    record = record.split('	') 
    img_name = record[0][:-2] 
    img_caption = record[1].strip() 
    caption_map[img_name].append(img_caption)

We will be leveraging this later on when we build our datasets for training and testing. Let's now focus on feature extraction. Before the image feature extraction, we need to preprocess the raw input images to the right size and scale the pixel values based on the model we will be using. The following code will help us with the necessary image preprocessing steps:

from keras.preprocessing import image 
from keras.applications.vgg16 import preprocess_input as preprocess_vgg16_input 
 
def process_image2arr(path, img_dims=(224, 224)): 
    img = image.load_img(path, target_size=img_dims) 
    img_arr = image.img_to_array(img) 
    img_arr = np.expand_dims(img_arr, axis=0) 
    img_arr = preprocess_vgg16_input(img_arr) 
    return img_arr

We will also need to load up the pretrained VGG-16 model to leverage transfer learning. This is achieved with the following code snippet:

from keras.applications import vgg16 
from keras.models import Model 
 
 
vgg_model = vgg16.VGG16(include_top=True, weights='imagenet',  
                        input_shape=(224, 224, 3)) 
vgg_model.layers.pop() 
output = vgg_model.layers[-1].output 
vgg_model = Model(vgg_model.input, output) 
vgg_model.trainable = False 
 
vgg_model.summary() 
 


_________________________________________________________________ 
Layer (type)                 Output Shape              Param #    
================================================================= 
input_1 (InputLayer)         (None, 224, 224, 3)       0          
_________________________________________________________________ 
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792       
_________________________________________________________________ 
... 
... 
block5_conv3 (Conv2D)        (None, 14, 14, 512)       2359808    
_________________________________________________________________ 
block5_pool (MaxPooling2D)   (None, 7, 7, 512)         0          
_________________________________________________________________ 
flatten (Flatten)            (None, 25088)             0          
_________________________________________________________________ 
fc1 (Dense)                  (None, 4096)              102764544  
_________________________________________________________________ 
fc2 (Dense)                  (None, 4096)              16781312   
================================================================= 
Total params: 134,260,544 
Trainable params: 0 
Non-trainable params: 134,260,544 
_________________________________________________________________

It is quite evident that we remove the softmax layer and make the model non-trainable, since we are only interested in extracting dense feature vectors from the input images. We will now build a function that leverages our utility functions and helps extract the right features from input images:

def extract_tl_features_vgg(model, image_file_name, 
                            image_dir='../Flickr8k_imgs/'): 
    pr_img = process_image2arr(image_dir+image_file_name) 
    tl_features = model.predict(pr_img) 
    tl_features = np.reshape(tl_features, tl_features.shape[1]) 
    return tl_features

Let's now put all our previous functions and our pretrained model to the test by extracting image features and building out our train and test datasets:

img_tl_featureset = dict() 
train_img_names = [] 
train_img_captions = [] 
test_img_names = [] 
test_img_captions = [] 
 
for img in train_imgs: 
    img_tl_featureset[img] = extract_tl_features_vgg(model=vgg_model, 
                              image_file_name=img) 
    for caption in caption_map[img]: 
        train_img_names.append(img) 
        train_img_captions.append(caption) 
         
for img in test_imgs: 
    img_tl_featureset[img] = extract_tl_features_vgg(model=vgg_model, 
                              image_file_name=img) 
    for caption in caption_map[img]: 
        test_img_names.append(img) 
        test_img_captions.append(caption) 
         
train_dataset = pd.DataFrame({'image': train_img_names, 'caption': 
                               train_img_captions}) 
test_dataset = pd.DataFrame({'image': test_img_names, 'caption': 
                              test_img_captions}) 
print('Train Dataset Size:', len(train_dataset), '	Test Dataset Size:', len(test_dataset)) 
 
Train Dataset Size: 35000  Test Dataset Size: 5000

We can also see what the train dataset looks like by using the following code:

train_dataset.head(10)

The output of the preceding code is as follows:

It is quite evident that we have five captions for each input image and we maintain that in our datasets. We will now save the record of these datasets and our image features learned from transfer learning to the disk so we can easily load it up in memory during model training instead of extracting these features every time we want to run our model:

# save dataset records 
train_dataset = train_dataset[['image', 'caption']] 
test_dataset = test_dataset[['image', 'caption']] 
 
train_dataset.to_csv('image_train_dataset.tsv', sep='	', index=False) 
test_dataset.to_csv('image_test_dataset.tsv', sep='	', index=False) 
 
# save transfer learning image features 
from sklearn.externals import joblib 
joblib.dump(img_tl_featureset, 'transfer_learn_img_features.pkl') 
 
['transfer_learn_img_features.pkl']

Also, if needed, you can validate how the image features look by using the following code snippets for some initial checks:

[(key, value.shape) for key, value in  
                         img_tl_featureset.items()][:5] 
 
[('3079787482_0757e9d167.jpg', (4096,)), 
 ('3284955091_59317073f0.jpg', (4096,)), 
 ('1795151944_d69b82f942.jpg', (4096,)), 
 ('3532192208_64b069d05d.jpg', (4096,)), 
 ('454709143_9c513f095c.jpg', (4096,))] 
 
 
[(k, np.round(v, 3)) for k, v in img_tl_featureset.items()][:5] 
 
[('3079787482_0757e9d167.jpg', 
  array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)), 
 ('3284955091_59317073f0.jpg', 
  array([0.615, 0.   , 0.653, ..., 0.   , 1.559, 2.614], dtype=float32)), 
 ('1795151944_d69b82f942.jpg', 
  array([0.   , 0.   , 0.   , ..., 0.   , 0.   , 0.538], dtype=float32)), 
 ('3532192208_64b069d05d.jpg', 
  array([0.   , 0.   , 0.   , ..., 0.   , 0.   , 2.293], dtype=float32)), 
 ('454709143_9c513f095c.jpg', 
  array([0.   , 0.   , 0.131, ..., 0.833, 4.263, 0.   ], dtype=float32))]

We will be using these features in the next part of our modeling.

Table of Contents for Image feature extraction with transfer learning

Create new playlist

Sign In

Sign Up

Table of Contents for
Image feature extraction with transfer learning