I\'m writing a neural-network classifier in TensorFlow/Python for the notMNIST dataset. I\'ve implemented l2 regularization and dropout on the hidden layers. It works fine as
I had the same problem and reducing the batch size and learning rate worked for me.
Turns out this was not so much a coding issue as a Deep Learning Issue. The extra layer made the gradients too unstable, and that lead to the loss function quickly devolving to NaN. The best way to fix this is to use Xavier initialization. Otherwise, the variance of the initial values will tend to be too high, causing instability. Also, decreasing the learning rate may help.