Now, we will look at a small variant of the Adam algorithm called Adamax. Let's recall the equation of the second-order moment in Adam:
As you may have noticed from the preceding equation, we scale the gradients inversely proportional to the norm of the current and past gradients ( norm basically means the square of values):
Instead of having just , can we generalize it to the norm? In general, when we have a large for norm, our update would become unstable. However, when we set the value to , that is, when , the equation becomes simple and stable. Instead of just parameterizing the gradients, , alone, we also parameterize the decay rate, . Thus, we can write the following:
When we set the limits, tends to reach infinity, and then we get the following final equation:
We can rewrite the preceding equation as a simple recursive equation, as follows:
Computing is similar to what we saw in the Adaptive moment estimation section, so we can write the following directly:
By doing this, we can compute the bias-corrected estimate of :
Therefore, the final update equation becomes the following:
To better understand the Adamax algorithm, let's code it, step by step.
First, we define the Adamax function, as follows:
def Adamax(data, theta, lr = 1e-2, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-6, num_iterations = 1000):
Then, we initialize the first moment, mt, and the second moment, vt, with zeros:
mt = np.zeros(theta.shape[0])
vt = np.zeros(theta.shape[0])
For every iteration, we perform the following steps:
for t in range(num_iterations):
Now, we can compute the gradients with respect to theta, as follows:
gradients = compute_gradients(data, theta)
Then, we compute the first moment, mt, as :
mt = beta1 * mt + (1. - beta1) * gradients
Next, we compute the second moment, vt, as :
vt = np.maximum(beta2 * vt, np.abs(gradients))
Now, we can compute the bias-corrected estimate of mt; that is, :
mt_hat = mt / (1. - beta1 ** (t+1))
Update the model parameter, theta, so that it's :
theta = theta - ((lr / (vt + epsilon)) * mt_hat)
return theta