Adam optimizer goes haywire after 200k batches, training loss grows

前端 未结 2 1188
爱一瞬间的悲伤
爱一瞬间的悲伤 2021-01-30 05:15

I\'ve been seeing a very strange behavior when training a network, where after a couple of 100k iterations (8 to 10 hours) of learning fine, everything breaks and the training l

2条回答
  •  [愿得一人]
    2021-01-30 05:56

    Yes. This is a known problem of Adam.

    The equations for Adam are

    t <- t + 1
    lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)
    
    m_t <- beta1 * m_{t-1} + (1 - beta1) * g
    v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g
    variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
    

    where m is an exponential moving average of the mean gradient and v is an exponential moving average of the squares of the gradients. The problem is that when you have been training for a long time, and are close to the optimal, then v can become very small. If then all of a sudden the gradients starts increasing again it will be divided by a very small number and explode.

    By default beta1=0.9 and beta2=0.999. So m changes much more quickly than v. So m can start being big again while v is still small and cannot catch up.

    To remedy to this problem you can increase epsilon which is 10-8 by default. Thus stopping the problem of dividing almost by 0. Depending on your network, a value of epsilon in 0.1, 0.01, or 0.001 might be good.

提交回复
热议问题