发表新帖

发表新帖

How should the learning rate change as the batch size change?

前端未结

关注

 2  1129

孤城傲影 2021-01-29 23:49

When I increase/decrease batch size of the mini-batch used in SGD, should I change learning rate? If so, then how?

For reference, I was discussing with someone, and it w

2条回答

野趣味 (楼主)

2021-01-30 00:07

Theory suggests that when multiplying the batch size by k, one should multiply the learning rate by sqrt(k) to keep the variance in the gradient expectation constant. See page 5 at A. Krizhevsky. One weird trick for parallelizing convolutional neural networks: https://arxiv.org/abs/1404.5997

However, recent experiments with large mini-batches suggest for a simpler linear scaling rule, i.e multiply your learning rate by k when using mini-batch size of kN. See P.Goyal et al.: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour https://arxiv.org/abs/1706.02677

I would say that with using Adam, Adagrad and other adaptive optimizers, learning rate may remain the same if batch size does not change substantially.

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题