Weight Initialization of Deep Learning Layer

Chinmayi Sahu
4 min readDec 30, 2020
*Its Layer NOT Unit

Choosing an initializer is an active area of research and there is no definite rule to choose an initializer. But these are some guidelines which came with experience.

Few points which we need to keep in mind while initializing weight.

  1. Weight should be small (not too small)
  2. Not all Zero or the same number
  3. Weight should have optimal variability

We know the parameters for Normal distribution are mean and standard deviation. For the Normal initializer, the mean-centered to 0 and the Standard Deviation depend on the given formula.

Parameters of Uniform distribution is Lower Limit and Upper Limit. The value ranges from [-Limit, Limit].

Here are some weight initialization techniques.

  1. GlorotNormal: Use it when the layer has a Sigmoid or Tanh activation unit. The formula for Standard Deviation is:
Glorot Normal
stddev almost same to np.std(values)

2. GlorotUniform: Use it when the layer has a Sigmoid or Tanh activation unit. The formula for Limit is:

Glorot Uniform

3. HeNormal: Use it when the layer has a ReLu activation unit. The formula for Standard Deviation is:

He Normal

4. HeUniform: Use it when the layer has a ReLu activation unit. The formula for Limit is:

He Uniform

5. LecunNormal: The formula for Standard Deviation is:

Lecun Normal

6. LecunUniform: The formula for Limit is:

Lecun Uniform

7. RandomNormal: Generally in the case of ML such as Logistic Regression, Logistic Regression the weights initialized randomly. It can be from a random normal or random uniform distribution.

# Standalone usage:
initializer = tf.keras.initializers.RandomNormal(mean=0., stddev=1.)
values = initializer(shape=(2, 2))

8. RandomUniform

tf.keras.initializers.RandomUniform(
minval=-0.05, maxval=0.05, seed=None
)

9. Zeros

10. Ones

11. Constant

12. Identity

Mistakes to avoid while initializing weight

Cae 1: Zero or the same number initialization

If we initialize all the weights to Zero, W Transpose X will be zero. This means we will be sending 0 input to every unit. There will not be any learning and result in the same Gradient update.

Even if we initialize the weight to the same number such as 1 or constant, the model will suffer from the problem of Symmetry i.e.

  1. No Learning — All the neuron will compute the same thing
  2. Same gradient update

We want the model to learn different aspects of the data. So it's always necessary to have asymmetry. A lesson from the Ensamble model — “More different the base models are, the better the output

Case 2: Large -ve Numbers in the initialization

Once we apply ReLU on normalized data, there will be a dead activation issue because the ReLU activation makes the -ve number zero.

In the case of Sigmoid or tanh, it will result in Vanishing gradient issues because of Squashing the value in the range.

Reference

Applied AI Course

--

--