Learning rate doesn't change for AdamOptimizer in TensorFlow

后端 未结 1 899
灰色年华
灰色年华 2021-02-04 20:07

I would like to see how the learning rate changes during training (print it out or create a summary and visualize it in tensorboard).

Here is a code snippet from what I

1条回答
  •  一向
    一向 (楼主)
    2021-02-04 21:02

    I was asking myself the exact same question, and wondering why wouldn't it change. By looking at the original paper (page 2), one sees that the self._lr stepsize (designed with alpha in the paper) is required by the algorithm, but never updated. We also see that there is an alpha_t that is updated for every t step, and should correspond to the self._lr_t attribute. But in fact, as you observe, evaluating the value for the self._lr_t tensor at any point during the training returns always the initial value, that is, _lr.

    So your question, as I understood it, is how to get the alpha_t for TensorFlow's AdamOptimizer as described in section 2 of the paper and in the corresponding TF v1.2 API page:

    alpha_t = alpha * sqrt(1-beta_2_t) / (1-beta_1_t)

    BACKGROUND

    As you observed, the _lr_t tensor doesn't change thorough the training, which may lead to the false conclusion that the optimizer doesn't adapt (this can be easily tested by switching to the vanilla GradientDescentOptimizer with the same alpha). And, in fact, other values do change: a quick look at the optimizer's __dict__ shows the following keys: ['_epsilon_t', '_lr', '_beta1_t', '_lr_t', '_beta1', '_beta1_power', '_beta2', '_updated_lr', '_name', '_use_locking', '_beta2_t', '_beta2_power', '_epsilon', '_slots'].

    By inspecting them through training, I noticed that only _beta1_power, _beta2_power and the _slots get updated.

    Further inspecting the optimizer's code, in line 211, we see the following update:

    update_beta1 = self._beta1_power.assign(
            self._beta1_power * self._beta1_t,
            use_locking=self._use_locking)
    

    Which basically means that _beta1_power, which is initialized with _beta1, will be multiplied by _beta_1_t after every iteration, which is also initialized with beta_1_t.

    But here comes the confusing part: _beta1_t and _beta2_t never get updated, so effectively they hold the initial values (_beta1and _beta2) through the whole training, contradicting the notation of the paper in a similar fashion as _lr and lr_t do. I guess this is for a reason but I personally don't know why, in any case this are protected/private attributes of the implementation (as they start with an underscore) and don't belong to the public interface (they may even change among TF versions).

    So after this small background we can see that _beta_1_power and _beta_2_power are the original beta values exponentiated to the current training step, that is, the equivalent to the variables referred with beta_tin the paper. Going back to the definition of alpha_t in the section 2 of the paper, we see that, with this information, it should be pretty straightforward to implement:

    SOLUTION

    optimizer = tf.train.AdamOptimizer()
    # rest of the graph...
    
    # ... somewhere in your session
    # note that a0 comes from a scalar, whereas bb1 and bb2 come from tensors and thus have to be evaluated
    a0, bb1, bb2 = optimizer._lr, optimizer._beta1_power.eval(), optimizer._beta2_power.eval()
    at = a0* (1-bb2)**0.5 /(1-bb1)
    print(at)
    

    The variable at holds the alpha_t for the current training step.

    DISCLAIMER

    I couldn't find a cleaner way of getting this value by just using the optimizer's interface, but please let me know if it exists one! I guess there is none, which actually puts into question the usefulness of plotting alpha_t, since it does not depend on the data.

    Also, to complete this information, section 2 of the paper also gives the formula for the weight updates, which is much more telling, but also more plot-intensive. For a very nice and good-looking implementation of that, you may want to take a look at this nice answer from the post that you linked.

    Hope it helps! Cheers,
    Andres

    0 讨论(0)
提交回复
热议问题