Gradient descent with momentum

We have a problem with SGD and mini-batch gradient descent due to the oscillations in the parameter update. Take a look at the following plot, which shows how mini-batch gradient descent is attaining convergence. As you can see, there are oscillations in the gradient steps. The oscillations are shown by the dotted line. As you may notice, it is making a gradient step toward one direction, and then taking a different direction, and so on, until it reaches convergence:

This oscillation occurs because, since we update the parameters after iterating every n number of data points, the direction of the update will have some variance, and this leads to oscillations in every gradient step. Due to this oscillation, it is hard to reach convergence, and it slows down the process of attaining it.

To alleviate this, we'll introduce a new technique called momentum. If we can understand what the right direction is for the gradient steps to attain convergence faster, then we can make our gradient steps navigate in that direction and reduce the oscillation in the irrelevant directions; that is, we can reduce taking directions that do not lead us to convergence.

So, how can we do this? We basically take a fraction of the parameter update from the previous gradient step and add it to the current gradient step. In physics, momentum keeps an object moving after a force is applied. Here, the momentum keeps our gradient moving toward the direction that leads to convergence.

If you take a look at the following equation, you can see we are basically taking the parameter update from the previous step, , and adding it to the current gradient step, . How much information we want to take from the previous gradient step depends on the factor, that is, , and the learning rate, which is denoted by :

In the preceding equation, is called velocity, and it accelerates gradients in the direction that leads to convergence. It also reduces oscillations in an irrelevant direction by adding a fraction of a parameter update from the previous step to the current step.

Thus, the parameter update equation with momentum is expressed as follows:

By doing this, performing mini-batch gradient descent with momentum helps us to reduce oscillations in gradient steps and attain convergence faster.

Now, let's look at the implementation of momentum.

First, we define the momentum function, as follows:

def momentum(data, theta, lr = 1e-2, gamma = 0.9, num_iterations = 1000):

Then, we initialize vt with zeros:

    vt = np.zeros(theta.shape[0])

The following code is executed to cover the range for each iteration:

    for t in range(num_iterations):

Now, we compute gradients with respect to theta:

        gradients = compute_gradients(data, theta)

Next, we update vt to be :

        vt = gamma * vt + lr * gradients

Now, we update the model parameter, theta, as :

        theta = theta - vt

 return theta

Table of Contents for Gradient descent with momentum

Create new playlist

Sign In

Sign Up

Table of Contents for
Gradient descent with momentum