Building a vocabulary for our captions

The next step involves doing some preprocessing for our caption data and building a vocabulary or metadata dictionary for our captions. We start by reading in our training dataset records and writing a function to preprocess our text captions:

train_df = pd.read_csv('image_train_dataset.tsv', delimiter='	') 
total_samples = train_df.shape[0] 
total_samples 
 
35000 
 
# function to pre-process text captions 
def preprocess_captions(caption_list): 
    pc = [] 
    for caption in caption_list: 
        caption = caption.strip().lower() 
        caption = caption.replace('.', '').replace(',',   
                      '').replace("'", "").replace('"', '') 
        caption = caption.replace('&','and').replace('(','').replace(')',   
                                         '').replace('-', ' ') 
        caption = ' '.join(caption.split())  
        caption = '<START> '+caption+' <END>' 
        pc.append(caption) 
    return pc

We will now preprocess our captions and build some basic metadata for our vocabulary, including utilities for converting unique words into numeric representations and vice-versa:

# pre-process caption data 
train_captions = train_df.caption.tolist() 
processed_train_captions = preprocess_captions(train_captions) 
 
tc_tokens = [caption.split() for caption in  
                      processed_train_captions] 
tc_tokens_length = [len(tokenized_caption) for tokenized_caption  
                        in tc_tokens] 
 
# build vocabulary metadata 
from collections import Counter 
 
tc_words = [word.strip() for word_list in tc_tokens for word in  
                           word_list] 
unique_words = list(set(tc_words)) 
token_counter = Counter(unique_words) 
 
word_to_index = {item[0]: index+1 for index, item in  
                   enumerate(dict(token_counter).items())} 
word_to_index['<PAD>'] = 0 
index_to_word = {index: word for word, index in  
                       word_to_index.items()} 
vocab_size = len(word_to_index) 
max_caption_size = np.max(tc_tokens_length)

It is important to ensure we save this vocabulary metadata to disk so we can reuse it anytime in the future for model training as well as predictions; otherwise, if we re-generate our vocabulary, it is quite possible that a model may have been trained with some other version of a vocabulary where word-to-number mappings may have been different. This would give us the wrong results and we'd lose valuable time:

from sklearn.externals import joblib 
 
vocab_metadata = dict() 
vocab_metadata['word2index'] = word_to_index 
vocab_metadata['index2word'] = index_to_word 
vocab_metadata['max_caption_size'] = max_caption_size 
vocab_metadata['vocab_size'] = vocab_size 
joblib.dump(vocab_metadata, 'vocabulary_metadata.pkl') 
 
['vocabulary_metadata.pkl']

If needed, you can check the contents of our vocabulary metadata using the following code snippet and also see how a general preprocessed text caption might look for one of the images:

# check vocabulary metadata 
{k: v if type(v) is not dict  
         else list(v.items())[:5]  
             for k, v in vocab_metadata.items()} 
 
{'index2word': [(0, '<PAD>'), (1, 'nearby'), (2, 'flooded'), 
                (3, 'fundraising'), (4, 'snowboarder')], 
 'max_caption_size': 39, 
 'vocab_size': 7927, 
 'word2index': [('reflections', 4122), ('flakes', 1829),    
       ('flexing', 7684), ('scaling', 1057), ('pretend', 6788)]} 
 
# check pre-processed caption 
processed_train_captions[0] 
 
'<START> a black dog is running after a white dog in the snow <END>'

We will be leveraging this shortly when we build a data-generator function that will serve as input for our deep learning model during model training.

Table of Contents for Building a vocabulary for our captions

Create new playlist

Sign In

Sign Up

Table of Contents for
Building a vocabulary for our captions