The total loss, , is the sum of losses at all time steps, and can be given as follows:
To minimize the loss using gradient descent, we find the derivative of loss with respect to all of the weights used in the GRU cell as follows:
- We have three input-to-hidden layer weights, , which are the input-to-hidden layer weights of the update gate, reset gate, and content state, respectively
- We have three hidden-to-hidden layer weights, , which are the hidden-to-hidden layer weights of the update gate, reset gate, and content state respectively
- We have one hidden-to-output layer weight,
We find the optimal values for all these weights through gradient descent and update the weights according to the weight update rule.