How should the learning rate change as the batch size change?

前端 未结 2 1125
孤城傲影
孤城傲影 2021-01-29 23:49

When I increase/decrease batch size of the mini-batch used in SGD, should I change learning rate? If so, then how?

For reference, I was discussing with someone, and it w

2条回答
  •  野趣味
    野趣味 (楼主)
    2021-01-30 00:07

    Theory suggests that when multiplying the batch size by k, one should multiply the learning rate by sqrt(k) to keep the variance in the gradient expectation constant. See page 5 at A. Krizhevsky. One weird trick for parallelizing convolutional neural networks: https://arxiv.org/abs/1404.5997

    However, recent experiments with large mini-batches suggest for a simpler linear scaling rule, i.e multiply your learning rate by k when using mini-batch size of kN. See P.Goyal et al.: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour https://arxiv.org/abs/1706.02677

    I would say that with using Adam, Adagrad and other adaptive optimizers, learning rate may remain the same if batch size does not change substantially.

提交回复
热议问题