Adaptive moment estimation with AMSGrad

One problem with the Adam algorithm is that it sometimes fails to attain optimal convergence, or it reaches a suboptimal solution. It has been noted that, in some settings, Adam fails to attain convergence or reach the suboptimal solution instead of a global optimal solution. This is due to exponentially moving the averages of gradients. Remember when we used the exponential moving averages of gradients in Adam to avoid the problem of learning rate decay?

However, the problem is that since we are taking an exponential moving average of gradients, we miss out information about the gradients that occur infrequently.

To resolve this issue, the authors of AMSGrad made a small change to the Adam algorithm. Recall the second-order moment estimates we saw in Adam, as follows:

In AMSGrad, we use a slightly modified version of . Instead of using directly, we take the maximum value of until the previous step, as follows:

This will retain the informative gradients instead of being phased out due to the exponential moving average.

So, our final update equation becomes the following:

Now, let's understand how to code AMSGrad in Python.

First, we define the AMSGrad function, as follows:

def AMSGrad(data, theta, lr = 1e-2, beta1 = 0.9, beta2 = 0.9, epsilon = 1e-6, num_iterations = 1000):

Then, we initialize the first moment, mt, the second moment, vt, and the modified version of vt, that is, vt_hat, with zeros, as follows:

    mt = np.zeros(theta.shape[0])
    vt = np.zeros(theta.shape[0])
    vt_hat = np.zeros(theta.shape[0])

For every iteration, we perform the following steps:

    for t in range(num_iterations):

Now, we can compute the gradients with respect to theta:

        gradients = compute_gradients(data, theta)

Then, we compute the first moment, mt, as :

        mt = beta1 * mt + (1. - beta1) * gradients

Next, we update the second moment, vt, as :

       vt = beta2 * vt + (1. - beta2) * gradients ** 2

In AMSGrad, we use a slightly modified version of . Instead of using directly, we take the maximum value of until the previous step. Thus, is implemented as follows:

        vt_hat = np.maximum(vt_hat,vt)

Here, we will compute the bias-corrected estimate of mt, that is, :

        mt_hat = mt / (1. - beta1 ** (t+1))

Now, we can update the model parameter, theta, so that it's :

          theta = theta - (lr / (np.sqrt(vt_hat) + epsilon)) * mt_hat

    return theta

Table of Contents for Adaptive moment estimation with AMSGrad

Create new playlist

Sign In

Sign Up

Table of Contents for
Adaptive moment estimation with AMSGrad