Why doesn't the Adadelta optimizer decay the learning rate?

问题

I have initialised an Adadelta optimizer in Keras (using Tensorflow backend) and assigned it to a model:

my_adadelta = keras.optimizers.Adadelta(learning_rate=0.01, rho=0.95)
my_model.compile(optimizer=my_adadelta, loss="binary_crossentropy")

During training, I am using a callback to print the learning rate after every epoch:

class LRPrintCallback(Callback):
    def on_epoch_end(self, epoch, logs=None):
        lr = self.model.optimizer.lr
        print(K.eval(lr))

However, this prints the same (initial) learning rate after every epoch. The same thing happens if I initialize the optimizer like this:

my_adadelta = keras.optimizers.Adadelta(learning_rate=0.01, decay=0.95)

Am I doing something wrong in the initialization? Is the learning rate maybe changing but I am not printing the right thing?

回答1:

As discussed in a relevant Github thread, the decay does not affect the variable lr itself, which is used only to store the initial value of the learning rate. In order to print the decayed value, you need to explicitly compute it yourself and store it in a separate variable lr_with_decay; you can do so by using the following callback:

class MyCallback(Callback):
    def on_epoch_end(self, epoch, logs=None):
        lr = self.model.optimizer.lr
        decay = self.model.optimizer.decay
        iterations = self.model.optimizer.iterations
        lr_with_decay = lr / (1. + decay * K.cast(iterations, K.dtype(decay)))
        print(K.eval(lr_with_decay))

as explained here and here. In fact, the specific code snippet suggested there, i.e.

lr = self.lr
if self.initial_decay > 0:
    lr *= (1. / (1. + self.decay * K.cast(self.iterations, K.dtype(self.decay))))

comes directly from the underlying Keras source code for Adadelta.

As clear from the inspection of the linked source code, the parameter of interest here for decaying the learning rate is decay, and not rho; despite the term 'decay' used also for describing rho in the documentation, it is a different decay not having anything to do with the learning rate:

rho: float >= 0. Adadelta decay factor, corresponding to fraction of gradient to keep at each time step.

来源：https://stackoverflow.com/questions/61657393/why-doesnt-the-adadelta-optimizer-decay-the-learning-rate

标签

python

tensorflow

machine-learning

keras

deep-learning