Adamax – Adam based on infinity-norm

Now, we will look at a small variant of the Adam algorithm called Adamax. Let's recall the equation of the second-order moment in Adam:

As you may have noticed from the preceding equation, we scale the gradients inversely proportional to the norm of the current and past gradients ( norm basically means the square of values):

Instead of having just , can we generalize it to the norm? In general, when we have a large for norm, our update would become unstable. However, when we set the value to , that is, when , the equation becomes simple and stable. Instead of just parameterizing the gradients, , alone, we also parameterize the decay rate, . Thus, we can write the following:

When we set the limits, tends to reach infinity, and then we get the following final equation:

You can check the paper listed in the Further reading section at the end of this chapter to see how exactly this is derived.

We can rewrite the preceding equation as a simple recursive equation, as follows:

Computing is similar to what we saw in the Adaptive moment estimation section, so we can write the following directly:

By doing this, we can compute the bias-corrected estimate of :

Therefore, the final update equation becomes the following:

To better understand the Adamax algorithm, let's code it, step by step.

First, we define the Adamax function, as follows:

def Adamax(data, theta, lr = 1e-2, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-6, num_iterations = 1000):

Then, we initialize the first moment, mt, and the second moment, vt, with zeros:

    mt = np.zeros(theta.shape[0])
vt = np.zeros(theta.shape[0])

For every iteration, we perform the following steps:

    for t in range(num_iterations):

Now, we can compute the gradients with respect to theta, as follows:

        gradients = compute_gradients(data, theta) 

Then, we compute the first moment, mt, as :

        mt = beta1 * mt + (1. - beta1) * gradients

Next, we compute the second moment, vt, as :

        vt = np.maximum(beta2 * vt, np.abs(gradients))

Now, we can compute the bias-corrected estimate of mt; that is, :

        mt_hat = mt / (1. - beta1 ** (t+1))

Update the model parameter, theta, so that it's :

        theta = theta - ((lr / (vt + epsilon)) * mt_hat)

return theta
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset