For a neural networks library I implemented some activation functions and loss functions and their derivatives. They can be combined arbitrarily and the derivative at the output
It should be like this: (x is the input to the softmax layer and dy is the delta coming from the loss above it)
dx = y * dy
s = dx.sum(axis=dx.ndim - 1, keepdims=True)
dx -= y * s
return dx
But the way you compute the error should be:
yact = activation.compute(x)
ycost = cost.compute(yact)
dsoftmax = activation.delta(x, cost.delta(yact, ycost, ytrue))
Explanation: Because the delta
function is a part of the backpropagation algorithm, its responsibility is to multiply the vector dy
(in my code, outgoing
in your case) by the Jacobian of the compute(x)
function evaluated at x
. If you work out what does this Jacobian look like for softmax [1], and then multiply it from the left by a vector dy
, after a bit of algebra you'll find out that you get something that corresponds to my Python code.
[1] https://stats.stackexchange.com/questions/79454/softmax-layer-in-a-neural-network