The next step involves doing some preprocessing for our caption data and building a vocabulary or metadata dictionary for our captions. We start by reading in our training dataset records and writing a function to preprocess our text captions:
train_df = pd.read_csv('image_train_dataset.tsv', delimiter=' ') total_samples = train_df.shape[0] total_samples 35000 # function to pre-process text captions def preprocess_captions(caption_list): pc = [] for caption in caption_list: caption = caption.strip().lower() caption = caption.replace('.', '').replace(',', '').replace("'", "").replace('"', '') caption = caption.replace('&','and').replace('(','').replace(')', '').replace('-', ' ') caption = ' '.join(caption.split()) caption = '<START> '+caption+' <END>' pc.append(caption) return pc
We will now preprocess our captions and build some basic metadata for our vocabulary, including utilities for converting unique words into numeric representations and vice-versa:
# pre-process caption data train_captions = train_df.caption.tolist() processed_train_captions = preprocess_captions(train_captions) tc_tokens = [caption.split() for caption in processed_train_captions] tc_tokens_length = [len(tokenized_caption) for tokenized_caption in tc_tokens] # build vocabulary metadata from collections import Counter tc_words = [word.strip() for word_list in tc_tokens for word in word_list] unique_words = list(set(tc_words)) token_counter = Counter(unique_words) word_to_index = {item[0]: index+1 for index, item in enumerate(dict(token_counter).items())} word_to_index['<PAD>'] = 0 index_to_word = {index: word for word, index in word_to_index.items()} vocab_size = len(word_to_index) max_caption_size = np.max(tc_tokens_length)
It is important to ensure we save this vocabulary metadata to disk so we can reuse it anytime in the future for model training as well as predictions; otherwise, if we re-generate our vocabulary, it is quite possible that a model may have been trained with some other version of a vocabulary where word-to-number mappings may have been different. This would give us the wrong results and we'd lose valuable time:
from sklearn.externals import joblib vocab_metadata = dict() vocab_metadata['word2index'] = word_to_index vocab_metadata['index2word'] = index_to_word vocab_metadata['max_caption_size'] = max_caption_size vocab_metadata['vocab_size'] = vocab_size joblib.dump(vocab_metadata, 'vocabulary_metadata.pkl') ['vocabulary_metadata.pkl']
If needed, you can check the contents of our vocabulary metadata using the following code snippet and also see how a general preprocessed text caption might look for one of the images:
# check vocabulary metadata {k: v if type(v) is not dict else list(v.items())[:5] for k, v in vocab_metadata.items()} {'index2word': [(0, '<PAD>'), (1, 'nearby'), (2, 'flooded'), (3, 'fundraising'), (4, 'snowboarder')], 'max_caption_size': 39, 'vocab_size': 7927, 'word2index': [('reflections', 4122), ('flakes', 1829), ('flexing', 7684), ('scaling', 1057), ('pretend', 6788)]} # check pre-processed caption processed_train_captions[0] '<START> a black dog is running after a white dog in the snow <END>'
We will be leveraging this shortly when we build a data-generator function that will serve as input for our deep learning model during model training.