Loss clipping in tensor flow (on DeepMind's DQN)

后端未结

关注

 4  523

渐次进展 2021-01-02 06:39

I am trying my own implementation of the DQN paper by Deepmind in tensor flow and am running into difficulty with clipping of the loss function.

Here is an excerpt

4条回答

醉梦人生 (楼主)

2021-01-02 07:13
I suspect they mean that you should clip the gradient to [-1,1], not clip the loss function. Thus, you compute the gradient as usual, but then clip each component of the gradient to be in the range [-1,1] (so if it is larger than +1, you replace it with +1; if it is smaller than -1, you replace it with -1); and then you use the result in the gradient descent update step instead of using the unmodified gradient.

Equivalently: Define a function f as follows:
```
f(x) = x^2          if x in [-0.5,0.5]
f(x) = |x| - 0.25   if x < -0.5 or x > 0.5
```
Instead of using something of the form s^2 as the loss function (where s is some complicated expression), they suggest to use f(s) as the loss function. This is some kind of hybrid between squared-loss and absolute-value-loss: will behave like s^2 when s is small, but when s gets larger, it will behave like the absolute value (|s|).

Notice that the derivative of f has the nice property that its derivative will always be in the range [-1,1]:
```
f'(x) = 2x    if x in [-0.5,0.5]
f'(x) = +1    if x > +1
f'(x) = -1    if x < -1
```
Thus, when you take the gradient of this f-based loss function, the result will be the same as computing the gradient of a squared-loss and then clipping it.

Thus, what they're doing is effectively replacing a squared-loss with a Huber loss. The function f is just two times the Huber loss for delta = 0.5.

Now the point is that the following two alternatives are equivalent:
- Use a squared loss function. Compute the gradient of this loss function, but the gradient to [-1,1] before doing the update step of the gradient descent.
- Use a Huber loss function instead of a squared loss function. Compute the gradient of this loss function directly (unchanged) in the gradient descent.
The former is easy to implement. The latter has nice properties (improves stability; it's better than absolute-value-loss because it avoids oscillating around the minimum). Because the two are equivalent, this means we get an easy-to-implement scheme that has the simplicity of squared-loss with the stability and robustness of the Huber loss.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...