Adaptive moment estimation

Adaptive moment estimation, known as Adam for short, is one of the most popularly used algorithms for optimizing a neural network. While reading about RMSProp, we learned that we compute the running average of squared gradients to avoid the diminishing learning rate problem:

The final updated equation of RMSprop is given as follows:

Similar to this, in Adam, we also compute the running average of the squared gradients. However, along with computing the running average of the squared gradients, we also compute the running average of the gradients.

The running average of gradients is given as follows:

The running average of squared gradients is given as follows:

Since a lot of literature and libraries represent the decaying rate in Adam as instead of , we'll also use to represent the decaying rate in Adam. Thus, and in equations (16) and (17) denote the exponential decay rates for the running average of the gradients and the squared gradients, respectively.

So, our updated equation becomes the following:

The running average of the gradients and running average of the squared gradients are basically the first and second moments of those gradients. That is, they are the mean and uncentered variance of our gradients, respectively. So, for notation simplicity, let's denote as and as .

Therefore, we can rewrite equations (16) and (17) as follows:

We begin by setting the initial moments estimates to zero. That is, we initialize and with zeros. When the initial estimates are set to 0, they remain very small, even after many iterations. This means that they would be biased toward 0, especially when and are close to 1. So, to combat this, we compute the bias-corrected estimates of and by just dividing them by , as follows:

Here, and are the bias-corrected estimates of and , respectively.

So, our final update equation is given as follows:

Now, let's understand how to implement Adam in Python.

First, let's define the Adam function, as follows:

def Adam(data, theta, lr = 1e-2, beta1 = 0.9, beta2 = 0.9, epsilon = 1e-6, num_iterations = 1000):

Then, we initialize the first moment, mt, and the second moment, vt, with zeros:

    mt = np.zeros(theta.shape[0])
    vt = np.zeros(theta.shape[0])

For every iteration, we perform the following steps:

    for t in range(num_iterations):

Next, we compute the gradients with respect to theta:

        gradients = compute_gradients(data, theta)

Then, we update the first moment, mt, so that it's :

        mt = beta1 * mt + (1. - beta1) * gradients

Next, we update the second moment, vt, so that it's :

        vt = beta2 * vt + (1. - beta2) * gradients ** 2

Now, we compute the bias-corrected estimate of mt, that is, :

        mt_hat = mt / (1. - beta1 ** (t+1))

Next, we compute the bias-corrected estimate of vt, that is, :

        vt_hat = vt / (1. - beta2 ** (t+1))

Finally, we update the model parameter, theta, so that it's :

        theta = theta - (lr / (np.sqrt(vt_hat) + epsilon)) * mt_hat

    return theta

Table of Contents for Adaptive moment estimation

Create new playlist

Sign In

Sign Up

Table of Contents for
Adaptive moment estimation