Batch normalization when batch size=1

后端 未结 1 2046
无人共我
无人共我 2021-01-21 19:16

What will happen when I use batch normalization but set batch_size = 1?

Because I am using 3D medical images as training dataset, the batch size can only be

1条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2021-01-21 20:00

    variance will be 0

    No, it won't; BatchNormalization computes statistics only with respect to a single axis (usually the channels axis, =-1 (last) by default); every other axis is collapsed, i.e. summed over for averaging; details below.

    More importantly, however, unless you can explicitly justify it, I advise against using BatchNormalization with batch_size=1; there are strong theoretical reasons against it, and multiple publications have shown BN performance degrade for batch_size under 32, and severely for <=8. In a nutshell, batch statistics "averaged" over a single sample vary greatly sample-to-sample (high variance), and BN mechanisms don't work as intended.

    Small mini-batch alternatives: Batch Renormalization -- Layer Normalization -- Weight Normalization


    Implementation details: from source code:

    reduction_axes = list(range(len(input_shape)))
    del reduction_axes[self.axis]
    

    Eventually, tf.nn.monents is called with axes=reduction_axes, which performs a reduce_sum to compute variance. Then, in the TensorFlow backend, mean and variance are passed to tf.nn.batch_normalization to return train- or inference-normalized inputs.

    In other words, if your input is (batch_size, height, width, depth, channels), or (1, height, width, depth, channels), then BN will run calculations over the 1, height, width, and depth dimensions.

    Can variance ever be zero? - yes, if every single datapoint for any given channel slice (along every dimension) is the same. But this should be near-impossible for real data.


    Other answers: first one is misleading:

    a small rational number is added (1e-19) to the variance

    This doesn't happen in computing variance, but it is added to variance when normalizing; nonetheless, it is rarely necessary, as variance is far from zero. Also, the epsilon term is actually defaulted to 1e-3 by Keras; it serves roles in regularizing, beyond mere avoiding zero-division.


    Update: I failed to address an important piece of intuition with suspecting variance to be 0; indeed, the batch statistics variance is zero, since there is only one statistic - but the "statistic" itself concerns the mean & variance of the channel + spatial dimensions. In other words, the variance of the mean & variance is zero, but the mean & variance themselves aren't.

    0 讨论(0)
提交回复
热议问题