Nadam – adding NAG to ADAM

Nadam is another small extension of the Adam method. As the name suggests, here, we incorporate NAG into Adam. First, let's recall what we learned about in Adam.

We calculated the first and second moments as follows:

Then, we calculated the bias-corrected estimates of the first and second moments, as follows:

Our final update equation of Adam is expressed as follows:

Now, we will see how Nadam modifies Adam to use Nesterov momentum. In Adam, we compute the first moment as follows:

We change this first moment so that it's Nesterov accelerated momentum. That is, instead of using the previous momentum, we use the current momentum and use that as a lookahead:

We can't compute the bias-corrected estimates in the same way as we computed them in Adam because, here, comes from the current step, and comes from the subsequent step. Therefore, we change the bias-corrected estimate step, as follows:

Thus, we can rewrite our first-moment equation as follows:

Therefore, our final update equation becomes the following:

Now let's see how we can implement the Nadam algorithm in Python.

First, we define the nadam function:

def nadam(data, theta, lr = 1e-2, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-6, num_iterations = 500):

Then, we initialize the first moment, mt, and the second moment, vt, with zeros:

    mt = np.zeros(theta.shape[0])
vt = np.zeros(theta.shape[0])

Next, we set beta_prod to 1:

    beta_prod = 1

For every iteration, we perform the following steps:

    for t in range(num_iterations):

Then, we compute the gradients with respect to theta:


gradients = compute_gradients(data, theta)

Afterward, we compute the first moment, mt, so that it's :

        mt = beta1 * mt + (1. - beta1) * gradients

Now, we can update the second moment, vt, so that its' :

       vt = beta2 * vt + (1. - beta2) * gradients ** 2

Now, we compute beta_prod; that is, :

        beta_prod = beta_prod * (beta1)

Next, we compute the bias-corrected estimate of mt so that it's:

        mt_hat = mt / (1. - beta_prod)

Then, we compute the bias-corrected estimate of gt so that it's :

        g_hat = grad / (1. - beta_prod)

From here, we compute the bias-corrected estimate of vt so that it's:

        vt_hat = vt / (1. - beta2 ** (t))

Now, we compute mt_tilde so that it's :

        mt_tilde = (1-beta1**t+1) * mt_hat + ((beta1**t)* g_hat)

Finally, we update the model parameter, theta, by using :

        theta = theta - (lr / (np.sqrt(vt_hat) + epsilon)) * mt_hat

return theta

By doing this, we have learned about various popular variants of gradient descent algorithms that are used for training neural networks. The complete code to perform regression with all the variants of regression is available as a Jupyter Notebook at http://bit.ly/2XoW0vH.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset