问题
tf.raw_ops.TanhGrad says that grad = dy * (1 - y*y)
, where y = tanh(x)
.
But I think since dy / dx = 1 - y*y
, where y = tanh(x)
, grad should be dy / (1 - y*y)
. Where am I wrong?
回答1:
An expression like dy / dx
is a mathematical notation for the derivative, it is not an actual fraction. It is meaningless to move dy
or dx
around individually as you would with a numerator and denominator.
Mathematically, it is known that d(tanh(x))/dx = 1 - (tanh(x))^2
. TensorFlow computes gradients "backwards" (what is called backpropagation, or more generally reverse automatic differentiation). That means that, in general, we will reach the computation of the gradient of tanh(x)
after reaching the step where we compute the gradient of an "outer" function g(tanh(x))
. g
represents all the operations that are applied to the output of tanh
to reach the value for which the gradient is computed. The derivative of this function g
, according to the chain rule, is d(g(tanh(x)))/dx = d(g(tanh(x))/d(tanh(x)) * d(tanh(x))/dx
. The first factor, d(g(tanh(x))/d(tanh(x))
, is the reverse accumulated gradient up until tanh
, that is, the derivate of all those later operations, and is the value of dy
in the documentation of the function. Therefore, you only need to compute d(tanh(x))/dx
(which is (1 - y * y)
, because y = tanh(x)
) and multiply it by the given dy
. The resulting value will then be propagated further back to the operation that produced the input x
to tanh
in the first place, and it will become the dy
value in the computation of that gradient, and so on until the gradient sources are reached.
来源:https://stackoverflow.com/questions/62634073/why-gradient-of-tanh-in-tensorflow-is-grad-dy-1-yy