I am using a Softmax activation function in the last layer of a neural network. But I have problems with a safe implementation of this function.
A naive implementati
I know it's already answered but I'll post here a step-by-step anyway.
put on log:
zj = wj . x + bj
oj = exp(zj)/sum_i{ exp(zi) }
log oj = zj - log sum_i{ exp(zi) }
Let m be the max_i { zi } use the log-sum-exp trick:
log oj = zj - log {sum_i { exp(zi + m - m)}}
= zj - log {sum_i { exp(m) exp(zi - m) }},
= zj - log {exp(m) sum_i {exp(zi - m)}}
= zj - m - log {sum_i { exp(zi - m)}}
the term exp(zi-m) can suffer underflow if m is much greater than other z_i, but that's ok since this means z_i is irrelevant on the softmax output after normalization. final results is:
oj = exp (zj - m - log{sum_i{exp(zi-m)}})
First go to log scale, i.e calculate log(y)
instead of y
. The log of the numerator is trivial. In order to calculate the log of the denominator, you can use the following 'trick': http://lingpipe-blog.com/2009/06/25/log-sum-of-exponentials/