I am trying my own implementation of the DQN paper by Deepmind in tensor flow and am running into difficulty with clipping of the loss function.
Here is an excerpt
No. They talk about error clipping actually, not about loss clipping which is however as far as I know referring to the same thing but leads to confusion. They DO NOT mean that the loss below -1 is clipped to -1 and the loss above +1 is clipped to +1 because that leads to zero gradients outside the error range [-1;1] as you realized. Instead, they suggest to use a linear loss instead of a quadratic loss for error values < -1 and error values > 1.
Compute the error value (r + \gamma \max_{a'} Q(s',a'; \theta_i^-) - Q(s,a; \theta_i)). If this error value is within the range [-1;1], square it, if the error value is < -1 multiply by -1, if the error value is > 1 leave it as it is. If you use this as loss function the gradients outside the interval [-1;1] won't vanish.
In order to have a "smooth-looking" compound loss function you could also replace the squared loss outside the error range [-1;1] with a first-order Taylor approximation at the border values -1 and 1. In this case, if e was your error value, you would square it in case e \in [-1;1], in case e < -1, replace it by -2e-1, in case e > 1, replace it by 2e-1.