Adam optimizer goes haywire after 200k batches, training loss grows

前端未结

关注

 2  1189

I\'ve been seeing a very strange behavior when training a network, where after a couple of 100k iterations (8 to 10 hours) of learning fine, everything breaks and the training l

相关标签:

2条回答

[愿得一人]

2021-01-30 05:56
Yes. This is a known problem of Adam.

The equations for Adam are
```
t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

m_t <- beta1 * m_{t-1} + (1 - beta1) * g
v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
```
where m is an exponential moving average of the mean gradient and v is an exponential moving average of the squares of the gradients. The problem is that when you have been training for a long time, and are close to the optimal, then v can become very small. If then all of a sudden the gradients starts increasing again it will be divided by a very small number and explode.

By default beta1=0.9 and beta2=0.999. So m changes much more quickly than v. So m can start being big again while v is still small and cannot catch up.

To remedy to this problem you can increase epsilon which is 10-8 by default. Thus stopping the problem of dividing almost by 0. Depending on your network, a value of epsilon in 0.1, 0.01, or 0.001 might be good.
0 讨论(0)
发布评论:

提交评论
- 加载中...
南方客

2021-01-30 06:08

Yes this could be some sort of super complicated unstable numbers/equations case, but most certainty your training rate is just simply to high as your loss quickly decreases until 25K and then oscillates a lot in the same level. Try to decrease it by factor of 0.1 and see what happens. You should be able to reach even lower loss value.

Keep exploring! :)

0 讨论(0)
发布评论:

提交评论
- 加载中...