Gradients with respect to hidden to hidden layer weights, W

Now, we will compute the gradients of loss with respect to hidden to hidden layer weights, . Similar to , the final gradient is the sum of the gradients at all time steps:

So, we can write:

First, let's compute gradient of loss, with respect to , that is, .

We cannot compute derivative of with respect to directly from as there are no terms in it. So, we use the chain rule to compute the gradients of loss with respect to . Let's recollect the forward propagation equation:

First, we calculate the partial derivative of loss with respect to ; then, from , we calculate the partial derivative with respect to ; then, from , we can calculate the derivative with respect to W, as follows:

Now, let's compute the gradient of loss, with respect to , that is, . Thus, again, we will apply the chain rule and get the following:

If you look at the preceding equation, how can we calculate the term ? Let's recall the equation of :

As you can see in the preceding equation, computing depends on and , but is not a constant; it is a function again. So, we need to calculate the derivative with respect to that, as well.

The equation then becomes the following:

The following figure shows computing ; we can notice how is dependent on :

Now, let's compute the gradient of loss, with respect to , that is,. Thus, again, we will apply the chain rule and get the following:

In the preceding equation, we can't compute directly. Recall the equation of :

As you observe, computing , depends on a function , whereas is again a function which depends on function . As shown in the following figure, to compute the derivative with respect to , we need to traverse until , as each function is dependent on one another:

This can be pictorially represented as follows:

This applies for the loss at any time step; say, . So, we can say that to compute any loss , we need to traverse all the way to , as shown in the following figure:

This is because in RNNs, the hidden state at a time is dependent on a hidden state at a time , which implies that the current hidden state is always dependent on the previous hidden state.

So, any loss can be computed as shown in the following figure:

Thus, we can write, gradient of loss with respect to becomes:

The sum in the previous equation implies the sum over all the hidden states . In the preceding equation, can be computed using the chain rule. So, we can say: