Adam optimizer goes haywire after 200k batches, training loss grows

前端未结

关注

 2  1188

爱一瞬间的悲伤 2021-01-30 05:15

I\'ve been seeing a very strange behavior when training a network, where after a couple of 100k iterations (8 to 10 hours) of learning fine, everything breaks and the training l

2条回答

[愿得一人] (楼主)

2021-01-30 05:56
Yes. This is a known problem of Adam.

The equations for Adam are
```
t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

m_t <- beta1 * m_{t-1} + (1 - beta1) * g
v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
```
where m is an exponential moving average of the mean gradient and v is an exponential moving average of the squares of the gradients. The problem is that when you have been training for a long time, and are close to the optimal, then v can become very small. If then all of a sudden the gradients starts increasing again it will be divided by a very small number and explode.

By default beta1=0.9 and beta2=0.999. So m changes much more quickly than v. So m can start being big again while v is still small and cannot catch up.

To remedy to this problem you can increase epsilon which is 10-8 by default. Thus stopping the problem of dividing almost by 0. Depending on your network, a value of epsilon in 0.1, 0.01, or 0.001 might be good.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...