Weight Initialization of Deep Learning Layer

4 min readDec 30, 2020

Choosing an initializer is an active area of research and there is no definite rule to choose an initializer. But these are some guidelines which came with experience.

Few points which we need to keep in mind while initializing weight.

Weight should be small (not too small)
Not all Zero or the same number
Weight should have optimal variability

We know the parameters for Normal distribution are mean and standard deviation. For the Normal initializer, the mean-centered to 0 and the Standard Deviation depend on the given formula.

Parameters of Uniform distribution is Lower Limit and Upper Limit. The value ranges from [-Limit, Limit].

Here are some weight initialization techniques.

GlorotNormal: Use it when the layer has a Sigmoid or Tanh activation unit. The formula for Standard Deviation is:

Glorot Normal

2. GlorotUniform: Use it when the layer has a Sigmoid or Tanh activation unit. The formula for Limit is:

Glorot Uniform

3. HeNormal: Use it when the layer has a ReLu activation unit. The formula for Standard Deviation is:

He Normal

4. HeUniform: Use it when the layer has a ReLu activation unit. The formula for Limit is:

He Uniform

5. LecunNormal: The formula for Standard Deviation is:

Lecun Normal

6. LecunUniform: The formula for Limit is:

Lecun Uniform

7. RandomNormal: Generally in the case of ML such as Logistic Regression, Logistic Regression the weights initialized randomly. It can be from a random normal or random uniform distribution.

# Standalone usage:
initializer = tf.keras.initializers.RandomNormal(mean=0., stddev=1.)
values = initializer(shape=(2, 2))

8. RandomUniform

tf.keras.initializers.RandomUniform(
    minval=-0.05, maxval=0.05, seed=None
)

9. Zeros

10. Ones

11. Constant

12. Identity

Mistakes to avoid while initializing weight

Cae 1: Zero or the same number initialization

If we initialize all the weights to Zero, W Transpose X will be zero. This means we will be sending 0 input to every unit. There will not be any learning and result in the same Gradient update.

Even if we initialize the weight to the same number such as 1 or constant, the model will suffer from the problem of Symmetry i.e.

No Learning — All the neuron will compute the same thing
Same gradient update

We want the model to learn different aspects of the data. So it's always necessary to have asymmetry. A lesson from the Ensamble model — “More different the base models are, the better the output”

Case 2: Large -ve Numbers in the initialization

Once we apply ReLU on normalized data, there will be a dead activation issue because the ReLU activation makes the -ve number zero.

In the case of Sigmoid or tanh, it will result in Vanishing gradient issues because of Squashing the value in the range.

Weight Initialization of Deep Learning Layer

Mistakes to avoid while initializing weight

Cae 1: Zero or the same number initialization

Case 2: Large -ve Numbers in the initialization

Reference

Module: tf.keras.initializers | TensorFlow Core v2.4.0

Keras initializer serialization / deserialization.

Written by Chinmayi Sahu