Loss clipping in tensor flow (on DeepMind's DQN)

后端未结

关注

 4  527

I am trying my own implementation of the DQN paper by Deepmind in tensor flow and am running into difficulty with clipping of the loss function.

Here is an excerpt

相关标签:

4条回答

走了就别回头了

2021-01-02 07:10
1. No. They talk about error clipping actually, not about loss clipping which is however as far as I know referring to the same thing but leads to confusion. They DO NOT mean that the loss below -1 is clipped to -1 and the loss above +1 is clipped to +1 because that leads to zero gradients outside the error range [-1;1] as you realized. Instead, they suggest to use a linear loss instead of a quadratic loss for error values < -1 and error values > 1.
2. Compute the error value (r + \gamma \max_{a'} Q(s',a'; \theta_i^-) - Q(s,a; \theta_i)). If this error value is within the range [-1;1], square it, if the error value is < -1 multiply by -1, if the error value is > 1 leave it as it is. If you use this as loss function the gradients outside the interval [-1;1] won't vanish.
In order to have a "smooth-looking" compound loss function you could also replace the squared loss outside the error range [-1;1] with a first-order Taylor approximation at the border values -1 and 1. In this case, if e was your error value, you would square it in case e \in [-1;1], in case e < -1, replace it by -2e-1, in case e > 1, replace it by 2e-1.
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉梦人生

2021-01-02 07:13
I suspect they mean that you should clip the gradient to [-1,1], not clip the loss function. Thus, you compute the gradient as usual, but then clip each component of the gradient to be in the range [-1,1] (so if it is larger than +1, you replace it with +1; if it is smaller than -1, you replace it with -1); and then you use the result in the gradient descent update step instead of using the unmodified gradient.

Equivalently: Define a function f as follows:
```
f(x) = x^2          if x in [-0.5,0.5]
f(x) = |x| - 0.25   if x < -0.5 or x > 0.5
```
Instead of using something of the form s^2 as the loss function (where s is some complicated expression), they suggest to use f(s) as the loss function. This is some kind of hybrid between squared-loss and absolute-value-loss: will behave like s^2 when s is small, but when s gets larger, it will behave like the absolute value (|s|).

Notice that the derivative of f has the nice property that its derivative will always be in the range [-1,1]:
```
f'(x) = 2x    if x in [-0.5,0.5]
f'(x) = +1    if x > +1
f'(x) = -1    if x < -1
```
Thus, when you take the gradient of this f-based loss function, the result will be the same as computing the gradient of a squared-loss and then clipping it.

Thus, what they're doing is effectively replacing a squared-loss with a Huber loss. The function f is just two times the Huber loss for delta = 0.5.

Now the point is that the following two alternatives are equivalent:
- Use a squared loss function. Compute the gradient of this loss function, but the gradient to [-1,1] before doing the update step of the gradient descent.
- Use a Huber loss function instead of a squared loss function. Compute the gradient of this loss function directly (unchanged) in the gradient descent.
The former is easy to implement. The latter has nice properties (improves stability; it's better than absolute-value-loss because it avoids oscillating around the minimum). Because the two are equivalent, this means we get an easy-to-implement scheme that has the simplicity of squared-loss with the stability and robustness of the Huber loss.
0 讨论(0)
发布评论:

提交评论
- 加载中...

眼角桃花

2021-01-02 07:13

In the Deep Mind paper you reference, they limit the gradient of the loss. This prevents giant gradients and so improves robustness. They do this by using a quadratic loss function for errors inside a small range, and using an absolute value loss for larger errors.

I suggest implementing the Huber loss function. Below is a python tensorflow implementation.

def huber_loss(y_true, y_pred, max_grad=1.):
    """Calculates the huber loss.

    Parameters
    ----------
    y_true: np.array, tf.Tensor
      Target value.
    y_pred: np.array, tf.Tensor
      Predicted value.
    max_grad: float, optional
      Positive floating point value. Represents the maximum possible
      gradient magnitude.

    Returns
    -------
    tf.Tensor
      The huber loss.
    """
    err = tf.abs(y_true - y_pred, name='abs')
    mg = tf.constant(max_grad, name='max_grad')

    lin = mg*(err-.5*mg)
    quad=.5*err*err

    return tf.where(err < mg, quad, lin)

0 讨论(0)

日久生厌

2021-01-02 07:23

First of all, the code for the paper is available online, which constitutes an invaluable reference.

Part 1

If you take a look at the code you will see that, in nql:getQUpdate (NeuralQLearner.lua, line 180), they clip the error term of the Q-learning function:

-- delta = r + (1-terminal) * gamma * max_a Q(s2, a) - Q(s, a)
if self.clip_delta then
    delta[delta:ge(self.clip_delta)] = self.clip_delta
    delta[delta:le(-self.clip_delta)] = -self.clip_delta
end

Part 2

In TensorFlow, assuming the last layer of your neural network is called self.output, self.actions is a one-hot encoding of all actions, self.q_targets_ is a placeholder with the targets, and self.q is your computed Q:

# The loss function
one = tf.Variable(1.0)
delta = self.q - self.q_targets_
absolute_delta = tf.abs(delta)
delta = tf.where(
    absolute_delta < one,
    tf.square(delta),
    tf.ones_like(delta) # squared error: (-1)^2 = 1
)

Or, using tf.clip_by_value (and having an implementation closer to the original):

delta = tf.clip_by_value(
    self.q - self.q_targets_,               
    -1.0,                
    +1.0                 
)

0 讨论(0)