Summary

We started off this chapter by learning about what convex and non-convex functions are. Then, we explored how we can find the minimum of a function using gradient descent. We learned how gradient descent minimizes a loss function by computing optimal parameters through gradient descent. Later, we looked at SGD, where we update the parameters of the model after iterating through each and every data point, and then we learned about mini-batch SGD, where we update the parameters after iterating through a batch of data points.

Going forward, we learned how momentum is used to reduce oscillations in gradient steps and attain convergence faster. Following this, we understood Nesterov momentum, where, instead of calculating the gradient at the current position, we calculate the gradient at the position the momentum will take us to.

We also learned about the Adagrad method, where we set the learning rate low for parameters that have frequent updates, and high for parameters that have infrequent updates. Next, we learned about the Adadelta method, where we completely do away with the learning rate and use an exponentially decaying average of gradients. We then learned about the Adam method, where we use both first and second momentum estimates to update gradients.

Following this, we explored variants of Adam, such as Adamax, where we generalized the norm of Adam to , and AMSGrad, where we combated the problem of Adam reaching a suboptimal solution. At the end of this chapter, we learned about Nadam, where we incorporated Nesterov momentum into the Adam algorithm.

In the next chapter, we will learn about one of the most widely used deep learning algorithms, called recurrent neural networks (RNNs), and how to use them to generate song lyrics.

Table of Contents for Summary

Create new playlist

Sign In

Sign Up

Table of Contents for
Summary