Loss clipping in tensor flow (on DeepMind's DQN)

后端 未结 4 523
渐次进展
渐次进展 2021-01-02 06:39

I am trying my own implementation of the DQN paper by Deepmind in tensor flow and am running into difficulty with clipping of the loss function.

Here is an excerpt

4条回答
  •  醉梦人生
    2021-01-02 07:13

    I suspect they mean that you should clip the gradient to [-1,1], not clip the loss function. Thus, you compute the gradient as usual, but then clip each component of the gradient to be in the range [-1,1] (so if it is larger than +1, you replace it with +1; if it is smaller than -1, you replace it with -1); and then you use the result in the gradient descent update step instead of using the unmodified gradient.

    Equivalently: Define a function f as follows:

    f(x) = x^2          if x in [-0.5,0.5]
    f(x) = |x| - 0.25   if x < -0.5 or x > 0.5
    

    Instead of using something of the form s^2 as the loss function (where s is some complicated expression), they suggest to use f(s) as the loss function. This is some kind of hybrid between squared-loss and absolute-value-loss: will behave like s^2 when s is small, but when s gets larger, it will behave like the absolute value (|s|).

    Notice that the derivative of f has the nice property that its derivative will always be in the range [-1,1]:

    f'(x) = 2x    if x in [-0.5,0.5]
    f'(x) = +1    if x > +1
    f'(x) = -1    if x < -1
    

    Thus, when you take the gradient of this f-based loss function, the result will be the same as computing the gradient of a squared-loss and then clipping it.

    Thus, what they're doing is effectively replacing a squared-loss with a Huber loss. The function f is just two times the Huber loss for delta = 0.5.

    Now the point is that the following two alternatives are equivalent:

    • Use a squared loss function. Compute the gradient of this loss function, but the gradient to [-1,1] before doing the update step of the gradient descent.

    • Use a Huber loss function instead of a squared loss function. Compute the gradient of this loss function directly (unchanged) in the gradient descent.

    The former is easy to implement. The latter has nice properties (improves stability; it's better than absolute-value-loss because it avoids oscillating around the minimum). Because the two are equivalent, this means we get an easy-to-implement scheme that has the simplicity of squared-loss with the stability and robustness of the Huber loss.

提交回复
热议问题