Weight Initialization of Deep Learning Layer
Choosing an initializer is an active area of research and there is no definite rule to choose an initializer. But these are some guidelines which came with experience.
Few points which we need to keep in mind while initializing weight.
- Weight should be small (not too small)
- Not all Zero or the same number
- Weight should have optimal variability
We know the parameters for Normal distribution are mean and standard deviation. For the Normal initializer, the mean-centered to 0 and the Standard Deviation depend on the given formula.
Parameters of Uniform distribution is Lower Limit and Upper Limit. The value ranges from [-Limit, Limit].
Here are some weight initialization techniques.
- GlorotNormal: Use it when the layer has a Sigmoid or Tanh activation unit. The formula for Standard Deviation is:
2. GlorotUniform: Use it when the layer has a Sigmoid or Tanh activation unit. The formula for Limit is:
3. HeNormal: Use it when the layer has a ReLu activation unit. The formula for Standard Deviation is:
4. HeUniform: Use it when the layer has a ReLu activation unit. The formula for Limit is:
5. LecunNormal: The formula for Standard Deviation is:
6. LecunUniform: The formula for Limit is:
7. RandomNormal: Generally in the case of ML such as Logistic Regression, Logistic Regression the weights initialized randomly. It can be from a random normal or random uniform distribution.
# Standalone usage:
initializer = tf.keras.initializers.RandomNormal(mean=0., stddev=1.)
values = initializer(shape=(2, 2))
tf.keras.initializers.RandomUniform(
minval=-0.05, maxval=0.05, seed=None
)
9. Zeros
10. Ones
11. Constant
12. Identity
Mistakes to avoid while initializing weight
Cae 1: Zero or the same number initialization
If we initialize all the weights to Zero, W Transpose X will be zero. This means we will be sending 0 input to every unit. There will not be any learning and result in the same Gradient update.
Even if we initialize the weight to the same number such as 1 or constant, the model will suffer from the problem of Symmetry i.e.
- No Learning — All the neuron will compute the same thing
- Same gradient update
We want the model to learn different aspects of the data. So it's always necessary to have asymmetry. A lesson from the Ensamble model — “More different the base models are, the better the output”
Case 2: Large -ve Numbers in the initialization
Once we apply ReLU on normalized data, there will be a dead activation issue because the ReLU activation makes the -ve number zero.
In the case of Sigmoid or tanh, it will result in Vanishing gradient issues because of Squashing the value in the range.