I have the following code.
x = keras.layers.Input(batch_shape = (None, 4096))
hidden = keras.layers.Dense(512, activation = \'relu\')(x)
hidden = keras.layers.Ba
The batch normalization in Keras implements this paper.
As you can read there, in order to make the batch normalization work during training, they need to keep track of the distributions of each normalized dimensions. To do so, since you are in mode=0
by default, they compute 4 parameters per feature on the previous layer. Those parameters are making sure that you properly propagate and backpropagate the information.
So 4*512 = 2048
, this should answer your question.
These 2048 parameters are in fact [gamma weights, beta weights, moving_mean(non-trainable), moving_variance(non-trainable)]
, each having 512 elements (the size of the input layer).