How should the learning rate change as the batch size change?

前端未结

关注

 2  1128

孤城傲影

When I increase/decrease batch size of the mini-batch used in SGD, should I change learning rate? If so, then how?

For reference, I was discussing with someone, and it w

相关标签:

2条回答

不思量自难忘°

2021-01-30 00:01

Apart from the papers mentioned in Dmytro's answer, you can refer to the article of: Jastrzębski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., & Storkey, A. (2018, October). Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio. The authors give the mathematical and empirical foundation to the idea that the ratio of learning rate to batch size influences the generalization capacity of DNN. They show that this ratio plays a major role in the width of the minima found by SGD. The higher ratio the wider is minima and better generalization.

0 讨论(0)
发布评论:

提交评论
- 加载中...
野趣味

2021-01-30 00:07

Theory suggests that when multiplying the batch size by k, one should multiply the learning rate by sqrt(k) to keep the variance in the gradient expectation constant. See page 5 at A. Krizhevsky. One weird trick for parallelizing convolutional neural networks: https://arxiv.org/abs/1404.5997

However, recent experiments with large mini-batches suggest for a simpler linear scaling rule, i.e multiply your learning rate by k when using mini-batch size of kN. See P.Goyal et al.: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour https://arxiv.org/abs/1706.02677

I would say that with using Adam, Adagrad and other adaptive optimizers, learning rate may remain the same if batch size does not change substantially.

0 讨论(0)
发布评论:

提交评论
- 加载中...