I\'m learning tensorflow, deep learning and experimenting various kinds of activation functions.
I created a multi-layer FFNN for the MNIST problem. Mostly based on the
You are using the Relu activation function that computes the activation as follows,
max(features, 0)
Since it outputs the max value, this sometimes causes the exploding gradient.
Gradientdecnt optimizer update the weight via the following,
∆wij = −η ∂Ei/ ∂wij
where η is the learning rate and ∂Ei/∂wij is the partial derivation of the loss w.r.t weight. When max values gets larger and larger, partial derivations also gets larger and causes the exploding gradient. Therefore, as you can observe in the equation, you need to tune the learning rate (η) to overcome this situation.
A common rule is to reduce the learning rate, usually by a factor of 10 each time.
For your case, set the learning rate = 0.001 and will improve the accuracy.
Hope this helps.