In simple multi-layer FFNN only ReLU activation function doesn't converge

I'm learning tensorflow, deep learning and experimenting various kinds of activation functions.

I created a multi-layer FFNN for the MNIST problem. Mostly based on the tutorial from the official tensorflow website, except that 3 hidden layers were added.

The activation functions I have experimented are: tf.sigmoid, tf.nn.tanh, tf.nn.softsign, tf.nn.softmax, tf.nn.relu. Only tf.nn.relu doesn't converge, the network output random noise (testing accuracy is about 10%). The following are my source code:

import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

x = tf.placeholder(tf.float32, [None, 784])

W0 = tf.Variable(tf.random_normal([784, 200]))
b0 = tf.Variable(tf.random_normal([200]))
hidden0 = tf.nn.relu(tf.matmul(x, W0) + b0)

W1 = tf.Variable(tf.random_normal([200, 200]))
b1 = tf.Variable(tf.random_normal([200]))
hidden1 = tf.nn.relu(tf.matmul(hidden0, W1) + b1)

W2 = tf.Variable(tf.random_normal([200, 200]))
b2 = tf.Variable(tf.random_normal([200]))
hidden2 = tf.nn.relu(tf.matmul(hidden1, W2) + b2)

W3 = tf.Variable(tf.random_normal([200, 10]))
b3 = tf.Variable(tf.random_normal([10]))
y = tf.matmul(hidden2, W3) + b3

y_ = tf.placeholder(tf.float32, [None, 10])

cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(cross_entropy)
with tf.Session() as session:
    session.run(tf.global_variables_initializer())
    for _ in range(10000):
        batch_xs, batch_ys = mnist.train.next_batch(128)
        session.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
        if _ % 1000 == 0:
            correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
            accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
            print(_, session.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

    correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    print('final:', session.run(accuracy, feed_dict={x: mnist.test.images,
            y_: mnist.test.labels}))

The code outputs something like this:

0 0.098
1000 0.098
2000 0.098
3000 0.098
4000 0.098
5000 0.098
6000 0.098
7000 0.098
8000 0.098
9000 0.098
final: 0.098

If tf.nn.relu is replaced with other activation functions the network accuracy improves gradually (with different final accuracy though), which is expected.

I have read in may textbooks/tutorials that ReLU should be the first candidate as activation function.

My question is why ReLU doesn't work in my network? or my program is simply wrong ?

You are using the Relu activation function that computes the activation as follows,

max(features, 0)

Since it outputs the max value, this sometimes causes the exploding gradient.

Gradientdecnt optimizer update the weight via the following,

∆wij = −η ∂Ei/ ∂wij

where η is the learning rate and ∂Ei/∂wij is the partial derivation of the loss w.r.t weight. When max values gets larger and larger, partial derivations also gets larger and causes the exploding gradient. Therefore, as you can observe in the equation, you need to tune the learning rate (η) to overcome this situation.

A common rule is to reduce the learning rate, usually by a factor of 10 each time.

For your case, set the learning rate = 0.001 and will improve the accuracy.

Hope this helps.

来源：https://stackoverflow.com/questions/47235290/in-simple-multi-layer-ffnn-only-relu-activation-function-doesnt-converge

标签

machine-learning

tensorflow

deep-learning

activation-function