Why doesn't the Adadelta optimizer decay the learning rate?

倖福魔咒の 提交于 2021-02-11 12:32:07

问题


I have initialised an Adadelta optimizer in Keras (using Tensorflow backend) and assigned it to a model:

my_adadelta = keras.optimizers.Adadelta(learning_rate=0.01, rho=0.95)
my_model.compile(optimizer=my_adadelta, loss="binary_crossentropy")

During training, I am using a callback to print the learning rate after every epoch:

class LRPrintCallback(Callback):
    def on_epoch_end(self, epoch, logs=None):
        lr = self.model.optimizer.lr
        print(K.eval(lr))

However, this prints the same (initial) learning rate after every epoch. The same thing happens if I initialize the optimizer like this:

my_adadelta = keras.optimizers.Adadelta(learning_rate=0.01, decay=0.95)

Am I doing something wrong in the initialization? Is the learning rate maybe changing but I am not printing the right thing?


回答1:


As discussed in a relevant Github thread, the decay does not affect the variable lr itself, which is used only to store the initial value of the learning rate. In order to print the decayed value, you need to explicitly compute it yourself and store it in a separate variable lr_with_decay; you can do so by using the following callback:

class MyCallback(Callback):
    def on_epoch_end(self, epoch, logs=None):
        lr = self.model.optimizer.lr
        decay = self.model.optimizer.decay
        iterations = self.model.optimizer.iterations
        lr_with_decay = lr / (1. + decay * K.cast(iterations, K.dtype(decay)))
        print(K.eval(lr_with_decay))

as explained here and here. In fact, the specific code snippet suggested there, i.e.

lr = self.lr
if self.initial_decay > 0:
    lr *= (1. / (1. + self.decay * K.cast(self.iterations, K.dtype(self.decay))))

comes directly from the underlying Keras source code for Adadelta.

As clear from the inspection of the linked source code, the parameter of interest here for decaying the learning rate is decay, and not rho; despite the term 'decay' used also for describing rho in the documentation, it is a different decay not having anything to do with the learning rate:

rho: float >= 0. Adadelta decay factor, corresponding to fraction of gradient to keep at each time step.



来源:https://stackoverflow.com/questions/61657393/why-doesnt-the-adadelta-optimizer-decay-the-learning-rate

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!