Conceptual approach

A successful image captioning system needs a way to translate a given image into a sequence of words. For extracting the right and relevant features from images, we can leverage a DCNN and, coupled with a recurrent neural network model, such as RNNs or LSTMs, we can build a hybrid generative model to start generating sequences of words as a caption, given a source image.

Thus, conceptually, the idea is to build a single hybrid model that can take a source image I as input, and can be trained to maximize the likelihood, P(S|I), such that S is the output of a sequence of words, which is our target output, and can be represented by S = {S₁,S₂, ..., S_n}, such that each word S_w comes from a given dictionary, which is our vocabulary. This caption S should be able to give a decent description for the input image.

Neural machine translation is an excellent inspiration for building such a system. Typically used in language models for language translation, the model architecture involves an encoder-decoder architecture built using RNNs or LSTMs. Typically, the encoder involves a LSTM model that reads an input sentence from the source language and transforms it into a dense fixed-length vector. This is then used as the initial hidden state of the decoder LSTM model, which ultimately generates an output sentence in the target language.

For image captioning, we will leverage a similar strategy, where our encoder that processes the input will leverage a DCNN model since our source data is images. By now, we have already seen the strength of CNN-based models for effective and rich feature extraction from images. Thus, source image data will be converted to dense numeric fixed-length vectors. Typically, a pretrained model leveraging a transfer learning approach will be the most effective here. This vector will serve as an input to our decoder LSTM model, which will generate a caption description as a sequence of words. Taking inspiration from the original paper, the objective to be maximized can be represented mathematically, as follows:

Here, Θ signifies the model parameters, I signifies the input image, and S is its corresponding caption description consisting of a sequence of words. Considering a caption description of length N, indicating a total of N words, we can model the joint probability over {S₀, S₁, ..., S_N} using the chain rule, as follows:

Thus, during model training, we have a pair of (I, S) image captions as inputs and the idea is to optimize the sum of the log probabilities for the preceding equation over the entire training data using an efficient algorithm, such as stochastic gradient descent. Considering the sequence of terms in the RHS of the preceding equation, an RNN-based model is the apt choice, such that the variable number of words sequentially until t-1 is expressed by memory state h_t. This is updated at every step, as follows, based on the previous t-1 states and the input pairs (images and next word) x_t using a non-linear function f(...):

h_t+1 = f(h_t, x_t)

Typically, x_t represents our image features and words, which are our inputs. For image features, we leverage DCNNs as we mentioned before. For the function f, we choose to use LSTMs since they are very effective in dealing with problems like vanishing and exploring gradients, which we have discussed in the initial chapters in this book. Considering a brief refresher for an LSTM memory block, let us refer to the following diagram from the Show and Tell research paper:

The memory block contains the LSTM cell c, which is controlled by the input, output, and forget gates. Cell c will encode knowledge at every time-step based on inputs until the previous time-steps. The three gates are layers that can be applied multiplicatively to either keep or reject a value from the gated layer if the gate is 1 or 0. The recurrent connections are shown in blue in the preceding diagram. We generally have multiple LSTMs in the model and the output m_t-1 of the LSTM at time t -1 is fed to the next LSTM at time t. Thus, this output m_t-1 at time t-1 is fed back to the memory block at time t using the three gates we discussed earlier. The actual cell value is also fed back using the forget gate. The memory output at time t, which is m_t is typically fed to the softmax to predict the next word.

This is usually obtained from the output gate o_t and the current cell state c_t. Some of these definitions and operations are depicted in the following diagram with the necessary equations:

Here, is the product operator used especially with current gate states and values. The W matrices are the trainable parameters in the network. These gates help in dealing with problems such as exploding and vanishing gradients. The non-linearity in the network is introduced by our regular sigmoid and hyperbolic-tangent h functions. As we discussed earlier, the memory output m_t is fed to the softmax to predict the next word, where the output is a probability distribution over all the words.

Thus, armed with this knowledge, you can consider that the LSTM-based sequence model needs to be combined with a necessary word embedding layer and the CNN-based model that produces dense features from the source image. Thus, the LSTM model's objective is to predict each word of the caption text, based on all the previous words predicted and also based on the input image, which is defined by our previous equation of p(S_t| I, S₀, S₁, ..., S_t-1). To simplify the recurrent connections in the LSTM, we can represent it in its unrolled form where we represent a series of LSTMs, which share the same parameters as depicted in the following diagram:

From the preceding diagram, it is pretty evident that the recurrent connections are represented by the blue horizontal arrows, based on the unrolled LSTM architecture, and have been transformed into feed-forward connections. Also, as evident, the output m_t-1of the LSTM at time t-1, is fed to the next LSTM at time t, and so on. Considering the source input image as I and the caption as S = {S₀, S₁, ..., S_N}, the following diagram depicts the major operations involved in the unrolled architecture depicted precedingly:

Here, each text word in the caption is represented by a one-hot vector S_t such that it's dimensions are equal to the size of our vocabulary (unique words). Also a point to note is that we have special markers or delimiter words for S₀, which we denote by <START> and S_N, which we denote by <END> to mark the start and end of a caption. This helps the LSTM understand when the caption has been generated completely.

The input image I is input to our DCNN model, which generates dense feature vectors and the words are transformed into dense word embeddings W_∈ based on the embedding layer. The overall loss function to be minimized is thus the negative log-likelihood of the right word at every step, as depicted in the following equation:

This loss is hence minimized during our model training, considering all the parameters in our model including our DCNN, LSTM, and embeddings. Let's now look at how we can put this into action.

Table of Contents for Conceptual approach

Create new playlist

Sign In

Sign Up

Table of Contents for
Conceptual approach