After predicting the output, , we are in the final layer of the network. Since we are backpropagating; that is, going from the output layer to the input layer, our first weight will be , which is hidden-to-output layer weight.
We have learned throughout that the final loss is the sum of the loss over all the time steps. In a similar manner, our final gradient is the sum of gradients at all time steps as follows:
If we have layers, then we can write the gradient of loss with respect to as follows:
Since the final equation of LSTM, that is, , is the same as RNN, calculating gradients of loss with respect to is exactly the same as what we computed in the RNN. Thus, we can directly write the following: