I am trying my own implementation of the DQN paper by Deepmind in tensor flow and am running into difficulty with clipping of the loss function.
Here is an excerpt
I suspect they mean that you should clip the gradient to [-1,1], not clip the loss function. Thus, you compute the gradient as usual, but then clip each component of the gradient to be in the range [-1,1] (so if it is larger than +1, you replace it with +1; if it is smaller than -1, you replace it with -1); and then you use the result in the gradient descent update step instead of using the unmodified gradient.
Equivalently: Define a function f
as follows:
f(x) = x^2 if x in [-0.5,0.5]
f(x) = |x| - 0.25 if x < -0.5 or x > 0.5
Instead of using something of the form s^2
as the loss function (where s
is some complicated expression), they suggest to use f(s)
as the loss function. This is some kind of hybrid between squared-loss and absolute-value-loss: will behave like s^2
when s
is small, but when s
gets larger, it will behave like the absolute value (|s|
).
Notice that the derivative of f
has the nice property that its derivative will always be in the range [-1,1]:
f'(x) = 2x if x in [-0.5,0.5]
f'(x) = +1 if x > +1
f'(x) = -1 if x < -1
Thus, when you take the gradient of this f
-based loss function, the result will be the same as computing the gradient of a squared-loss and then clipping it.
Thus, what they're doing is effectively replacing a squared-loss with a Huber loss. The function f
is just two times the Huber loss for delta = 0.5.
Now the point is that the following two alternatives are equivalent:
Use a squared loss function. Compute the gradient of this loss function, but the gradient to [-1,1] before doing the update step of the gradient descent.
Use a Huber loss function instead of a squared loss function. Compute the gradient of this loss function directly (unchanged) in the gradient descent.
The former is easy to implement. The latter has nice properties (improves stability; it's better than absolute-value-loss because it avoids oscillating around the minimum). Because the two are equivalent, this means we get an easy-to-implement scheme that has the simplicity of squared-loss with the stability and robustness of the Huber loss.