I would like to see how the learning rate changes during training (print it out or create a summary and visualize it in tensorboard).
Here is a code snippet from what I
I was asking myself the exact same question, and wondering why wouldn't it change. By looking at the original paper (page 2), one sees that the self._lr
stepsize (designed with alpha
in the paper) is required by the algorithm, but never updated. We also see that there is an alpha_t
that is updated for every t
step, and should correspond to the self._lr_t
attribute. But in fact, as you observe, evaluating the value for the self._lr_t
tensor at any point during the training returns always the initial value, that is, _lr
.
So your question, as I understood it, is how to get the alpha_t
for TensorFlow's AdamOptimizer as described in section 2 of the paper and in the corresponding TF v1.2 API page:
alpha_t = alpha * sqrt(1-beta_2_t) / (1-beta_1_t)
As you observed, the _lr_t
tensor doesn't change thorough the training, which may lead to the false conclusion that the optimizer doesn't adapt (this can be easily tested by switching to the vanilla GradientDescentOptimizer
with the same alpha
). And, in fact, other values do change: a quick look at the optimizer's __dict__
shows the following keys: ['_epsilon_t', '_lr', '_beta1_t', '_lr_t', '_beta1', '_beta1_power', '_beta2', '_updated_lr', '_name', '_use_locking', '_beta2_t', '_beta2_power', '_epsilon', '_slots']
.
By inspecting them through training, I noticed that only _beta1_power
, _beta2_power
and the _slots
get updated.
Further inspecting the optimizer's code, in line 211, we see the following update:
update_beta1 = self._beta1_power.assign(
self._beta1_power * self._beta1_t,
use_locking=self._use_locking)
Which basically means that _beta1_power
, which is initialized with _beta1, will be multiplied by _beta_1_t
after every iteration, which is also initialized with beta_1_t.
But here comes the confusing part: _beta1_t
and _beta2_t
never get updated, so effectively they hold the initial values (_beta1
and _beta2
) through the whole training, contradicting the notation of the paper in a similar fashion as _lr
and lr_t
do. I guess this is for a reason but I personally don't know why, in any case this are protected/private attributes of the implementation (as they start with an underscore) and don't belong to the public interface (they may even change among TF versions).
So after this small background we can see that _beta_1_power
and _beta_2_power
are the original beta values exponentiated to the current training step, that is, the equivalent to the variables referred with beta_t
in the paper. Going back to the definition of alpha_t
in the section 2 of the paper, we see that, with this information, it should be pretty straightforward to implement:
optimizer = tf.train.AdamOptimizer()
# rest of the graph...
# ... somewhere in your session
# note that a0 comes from a scalar, whereas bb1 and bb2 come from tensors and thus have to be evaluated
a0, bb1, bb2 = optimizer._lr, optimizer._beta1_power.eval(), optimizer._beta2_power.eval()
at = a0* (1-bb2)**0.5 /(1-bb1)
print(at)
The variable at
holds the alpha_t
for the current training step.
I couldn't find a cleaner way of getting this value by just using the optimizer's interface, but please let me know if it exists one! I guess there is none, which actually puts into question the usefulness of plotting alpha_t
, since it does not depend on the data.
Also, to complete this information, section 2 of the paper also gives the formula for the weight updates, which is much more telling, but also more plot-intensive. For a very nice and good-looking implementation of that, you may want to take a look at this nice answer from the post that you linked.
Hope it helps! Cheers,
Andres