Now, we will see how to calculate the gradients of loss with respect to hidden-to-hidden layer weights, , for all the gates and the content state.
Let's calculate the gradients of loss with respect to .
Recall the equation of the reset gate, which is given as follows:
Using the chain rule, we can write the following:
Let's calculate each of the terms in the preceding equation. The first term,, we already calculated in equation (11). The second term is calculated as follows:
Thus, our final equation for calculating the gradient of loss with respect to becomes the following:
Now, let's move on to finding the gradients of loss with respect to .
Recall the equation of the update gate, which is given as follows:
Using the chain rule, we can write the following:
We have already computed the first term in equation (12). The second term is computed as follows:
Thus, our final equation for calculating the gradient of loss with respect to becomes the following:
Now, we will find the gradients of loss with respect to .
Recall the content state equation:
Using the chain rule, we can write the following:
Refer to equation (10) for the first term. The second term is given as follows:
Thus, our final equation for calculating the gradient of loss with respect to becomes the following: