Decoder

Now, we will learn how the decoder generates the target sentence by using the thought vector, , generated by the encoder. A decoder is an RNN with LSTM or GRU cells. The goal of our decoder is to generate the target sentence for the given input (source) sentence.

We know that we start off an RNN by initializing its initial hidden state with random values, but for the decoder's RNN, we initialize the hidden state with the thought vector, , generated by the encoder, instead of initializing them with random values. The decoder network is shown in the following diagram:

But, what should be the input to the decoder? We simply pass <sos> as an input to the decoder, which indicates the start of the sentence. So, once the decode receives <sos>, it tries to predict the actual starting word of the target sentence. Let's represent the decoder hidden state by .

At the first time step, , we feed the first input, which is <sos>, to the decoder, and along with it, we pass the thought vector as the initial hidden state as follows:

Okay. What are we really doing here? We need to predict the output sequence, which is a French equivalent for our input English sentence. There are a lot of French words in our vocabulary. How does the decoder decide which word to output? That is, how does it decide the first word in our output sequence?

We feed the decoder hidden state, , to , which returns the score for all the words in our vocabulary to be the first output word. That is, the output word at a time step, is computed as follows:

Instead of having raw scores, we convert them into probabilities. Since we learned that the softmax function squashes values between 0 to 1, we use the softmax function for converting the score, , into a probability, :

Thus, we have probabilities for all the French words in our vocabulary to be the first output word. We select the word that has the highest probability as the first output word using the argmax function:

So, we have predicted that the first output word, , is Que, as shown in the preceding diagram.

On the next time step, , we feed the output word predicted at the previous time step, , as input to the decoder. Along with it, we also pass the previous hidden state, :

Then, we compute the score for all the words in our vocabulary to be the next output word, that is, the output word at time step :

Then, we convert the scores to probabilities using the softmax function:

Next, we select the word that has the highest probability as the output word, , at a time step, :

Thus, we initialize the decoder's initial hidden state with , and, on every time step, , we feed the predicted output word from the previous time step, , and the previous hidden state, , as an input to the decoder, , at the current time step, and predict the current output, .

But when does the decoder stop? Because our output sequence has to stop somewhere, we cannot keep on feeding the predicted output word from the previous time step as an input to the next time step. When the decoder predicts the output word as <sos>, this implies the end of the sentence. Then, the decoder learns that an input source sentence is converted to a meaningful target sentence and stops predicting the next word.

Thus, this is how the seq2seq model converts the source sentence to the target sentence.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset