Gradient Descent and Its Variants

Gradient descent is one of the most popular and widely used optimization algorithms, and is a first-order optimization algorithm. First-order optimization means that we calculate only the first-order derivative. As we saw in Chapter 1, Introduction to Deep Learning, we used gradient descent and calculated the first-order derivative of the loss function with respect to the weights of the network to minimize the loss.

Gradient descent is not only applicable to neural networks—it is also used in situations where we need to find the minimum of a function. In this chapter, we will go deeper into gradient descent, starting with the basics, and learn several variants of gradient descent algorithms. There are various flavors of gradient descent that are used for training neural networks. First, we will understand Stochastic Gradient Descent (SGD) and mini-batch gradient descent. Then, we'll explore how momentum is used to speed up gradient descent to attain convergence. Later in this chapter, we will learn about how to perform gradient descent in an adaptive manner by using various algorithms, such as Adagrad, Adadelta, RMSProp, Adam, Adamax, AMSGrad, and Nadam. We will take a simple linear regression equation and see how we can find the minimum of a linear regression's cost function using various types of gradient descent algorithms.

In this chapter, we will learn about the following topics:

Demystifying gradient descent
Gradient descent versus stochastic gradient descent
Momentum and Nesterov accelerated gradient
Adaptive methods of gradient descent

Table of Contents for Gradient Descent and Its Variants

Create new playlist

Sign In

Sign Up

Table of Contents for
Gradient Descent and Its Variants