Adaptive learning rate – separate for each connection

In the preceding methods the same learning rate is applied to all parameter updates. Having sparse data, we may instead want to update the parameters in different extent. Adaptive gradient descent algorithms, such as AdaGrad, AdaDelta, RMSprop, and Adam, provide an alternative to classical SGD by keeping per parameter learning rates.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset