I\'ve been seeing a very strange behavior when training a network, where after a couple of 100k iterations (8 to 10 hours) of learning fine, everything breaks and the training l
Yes. This is a known problem of Adam.
The equations for Adam are
t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)
m_t <- beta1 * m_{t-1} + (1 - beta1) * g
v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
where m
is an exponential moving average of the mean gradient and v
is an exponential moving average of the squares of the gradients. The problem is that when you have been training for a long time, and are close to the optimal, then v
can become very small. If then all of a sudden the gradients starts increasing again it will be divided by a very small number and explode.
By default beta1=0.9
and beta2=0.999
. So m
changes much more quickly than v
. So m
can start being big again while v
is still small and cannot catch up.
To remedy to this problem you can increase epsilon
which is 10-8
by default. Thus stopping the problem of dividing almost by 0.
Depending on your network, a value of epsilon
in 0.1
, 0.01
, or 0.001
might be good.