Initialization heuristics

Let's consider a dense layer with m input and n output units:

Sample each weight from the uniform distribution .

Glorot and Bengio suggested a normalized version of the uniform distribution initialization: . This is designed to have same variance of the gradients in each layer, called Glorot Uniform.
Sampling each weight from normal distribution with a mean of 0 and a variance of . This is similar to Glorot Uniform and called Glorot Normal.
For very large layers, individual weights will become extremely small. To fix that, an alternative is to initialize only k non-zero weights. This is called sparse initialization.
Initialize the weights to be random orthogonal matrices. Gram-Schmidt orthogonalization can be used on an initial weight matrix.

Initialization schemes can also be treated as a hyperparameter in neural network training. If we have enough computing resources, different initialization schemes can be evaluated and we can chose the one with the best generalization performance and faster convergence rate.

Table of Contents for Initialization heuristics

Create new playlist

Sign In

Sign Up

Table of Contents for
Initialization heuristics