When I increase/decrease batch size of the mini-batch used in SGD, should I change learning rate? If so, then how?
For reference, I was discussing with someone, and it w
Theory suggests that when multiplying the batch size by k, one should multiply the learning rate by sqrt(k) to keep the variance in the gradient expectation constant. See page 5 at A. Krizhevsky. One weird trick for parallelizing convolutional neural networks: https://arxiv.org/abs/1404.5997
However, recent experiments with large mini-batches suggest for a simpler linear scaling rule, i.e multiply your learning rate by k when using mini-batch size of kN. See P.Goyal et al.: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour https://arxiv.org/abs/1706.02677
I would say that with using Adam, Adagrad and other adaptive optimizers, learning rate may remain the same if batch size does not change substantially.