After performing forward propagation and predicting the output, we compute the loss. We use mean squared error as our loss function, and the total loss is the sum of losses across all of the time steps:
losses = []
for i in range(len(y_hat)):
losses.append(tf.losses.mean_squared_error(tf.reshape(target[i], (-1, 1)), y_hat[i]))
loss = tf.reduce_mean(losses)
To avoid the exploding gradient problem, we perform gradient clipping:
gradients = tf.gradients(loss, tf.trainable_variables())
clipped, _ = tf.clip_by_global_norm(gradients, 4.0)
We use the Adam optimizer and minimize our loss function:
optimizer = tf.train.AdamOptimizer(learning_rate).apply_gradients(zip(gradients, tf.trainable_variables()))