Gradients with respect to U

Let's calculate the gradients of loss with respect to hidden-to-input layer weights for all the gates and the candidate state. Computing gradients of loss with respect to is exactly the same as the gradients we computed with respect to , except that the last term will be instead of . Let's examine what we mean by that.

Let's find out the gradients of loss with respect to .

The input gate equation is as follows:

Thus, using the chain rule, we can write the following:

Let's calculate each of the terms in the preceding equation. We already know the first term from equation (2). So, the second term can be computed as follows:

Thus, our final equation for calculating the gradient of loss with respect to becomes the following:

As you can see, the preceding equation is exactly the same as , except that the last term is instead of . This applies for all other weights, so we can directly write the equations as follows: