I am trying my own implementation of the DQN paper by Deepmind in tensor flow and am running into difficulty with clipping of the loss function.
Here is an excerpt
No. They talk about error clipping actually, not about loss clipping which is however as far as I know referring to the same thing but leads to confusion. They DO NOT mean that the loss below -1 is clipped to -1 and the loss above +1 is clipped to +1 because that leads to zero gradients outside the error range [-1;1] as you realized. Instead, they suggest to use a linear loss instead of a quadratic loss for error values < -1 and error values > 1.
Compute the error value (r + \gamma \max_{a'} Q(s',a'; \theta_i^-) - Q(s,a; \theta_i)). If this error value is within the range [-1;1], square it, if the error value is < -1 multiply by -1, if the error value is > 1 leave it as it is. If you use this as loss function the gradients outside the interval [-1;1] won't vanish.
In order to have a "smooth-looking" compound loss function you could also replace the squared loss outside the error range [-1;1] with a first-order Taylor approximation at the border values -1 and 1. In this case, if e was your error value, you would square it in case e \in [-1;1], in case e < -1, replace it by -2e-1, in case e > 1, replace it by 2e-1.
I suspect they mean that you should clip the gradient to [-1,1], not clip the loss function. Thus, you compute the gradient as usual, but then clip each component of the gradient to be in the range [-1,1] (so if it is larger than +1, you replace it with +1; if it is smaller than -1, you replace it with -1); and then you use the result in the gradient descent update step instead of using the unmodified gradient.
Equivalently: Define a function f
as follows:
f(x) = x^2 if x in [-0.5,0.5]
f(x) = |x| - 0.25 if x < -0.5 or x > 0.5
Instead of using something of the form s^2
as the loss function (where s
is some complicated expression), they suggest to use f(s)
as the loss function. This is some kind of hybrid between squared-loss and absolute-value-loss: will behave like s^2
when s
is small, but when s
gets larger, it will behave like the absolute value (|s|
).
Notice that the derivative of f
has the nice property that its derivative will always be in the range [-1,1]:
f'(x) = 2x if x in [-0.5,0.5]
f'(x) = +1 if x > +1
f'(x) = -1 if x < -1
Thus, when you take the gradient of this f
-based loss function, the result will be the same as computing the gradient of a squared-loss and then clipping it.
Thus, what they're doing is effectively replacing a squared-loss with a Huber loss. The function f
is just two times the Huber loss for delta = 0.5.
Now the point is that the following two alternatives are equivalent:
Use a squared loss function. Compute the gradient of this loss function, but the gradient to [-1,1] before doing the update step of the gradient descent.
Use a Huber loss function instead of a squared loss function. Compute the gradient of this loss function directly (unchanged) in the gradient descent.
The former is easy to implement. The latter has nice properties (improves stability; it's better than absolute-value-loss because it avoids oscillating around the minimum). Because the two are equivalent, this means we get an easy-to-implement scheme that has the simplicity of squared-loss with the stability and robustness of the Huber loss.
I suggest implementing the Huber loss function. Below is a python tensorflow implementation.
def huber_loss(y_true, y_pred, max_grad=1.):
"""Calculates the huber loss.
Parameters
----------
y_true: np.array, tf.Tensor
Target value.
y_pred: np.array, tf.Tensor
Predicted value.
max_grad: float, optional
Positive floating point value. Represents the maximum possible
gradient magnitude.
Returns
-------
tf.Tensor
The huber loss.
"""
err = tf.abs(y_true - y_pred, name='abs')
mg = tf.constant(max_grad, name='max_grad')
lin = mg*(err-.5*mg)
quad=.5*err*err
return tf.where(err < mg, quad, lin)
First of all, the code for the paper is available online, which constitutes an invaluable reference.
If you take a look at the code you will see that, in nql:getQUpdate
(NeuralQLearner.lua
, line 180), they clip the error term of the Q-learning function:
-- delta = r + (1-terminal) * gamma * max_a Q(s2, a) - Q(s, a)
if self.clip_delta then
delta[delta:ge(self.clip_delta)] = self.clip_delta
delta[delta:le(-self.clip_delta)] = -self.clip_delta
end
In TensorFlow, assuming the last layer of your neural network is called self.output
, self.actions
is a one-hot encoding of all actions, self.q_targets_
is a placeholder with the targets, and self.q
is your computed Q:
# The loss function
one = tf.Variable(1.0)
delta = self.q - self.q_targets_
absolute_delta = tf.abs(delta)
delta = tf.where(
absolute_delta < one,
tf.square(delta),
tf.ones_like(delta) # squared error: (-1)^2 = 1
)
Or, using tf.clip_by_value
(and having an implementation closer to the original):
delta = tf.clip_by_value(
self.q - self.q_targets_,
-1.0,
+1.0
)