Backpropagation in LSTM

We compute the loss at each time step to determine how well our LSTM model is predicting the output. Say we use cross-entropy as a loss function, then the loss, , at time step is given by the following equation:

Here, is the actual output and is the predicted output at time step .

Our final loss is the sum of loss at all time steps, and can be given as follows:

We minimize the loss using gradient descent. We find the derivative of loss with respect to all of the weights used in the network and find the optimal weights to minimize the loss:

  • We have four inputs-to-hidden layer weights, , which are the input-to-hidden layer weights of the input gate, forget gate, output gate, and candidate state, respectively
  • We have four hidden-to-hidden layer weights, , which implies hidden-to-hidden layer weights of input gate, forget gate, output gate, and candidate state, respectively
  • We have one hidden-to-output layer weight,

We find the optimal values for all these weights through gradient descent and update the weights according to the weight update rule. The weight update rule is given by the following equation:

In the next section, we'll look at how to compute gradients of loss with respect to all of the weights used in the LSTM cell step by step.

You can skip the upcoming section if you are not interested in deriving gradients for all of the weights. However, it will strengthen your understanding of the LSTM cell.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset