问题
I am trying to implement a program to test the Tensorflow performance on GPU device. Data test is MNIST data, supervised training using Multilayer perceptron(Neural networks). I followed this simple example but I change the number of performance batch gradient to 10000
for i in range(10000) :
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step,feed_dict={x : batch_xs, y_ : batch_ys})
if i % 500 == 0:
print(i)
Eventually, when I check the predict accuracy using this code
correct_prediction = tf.equal(tf.argmax(y,1),tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction,"float"))
print(sess.run(accuracy,feed_dict={x:mnist.test.images,y_:mnist.test.labels}))
print(tf.convert_to_tensor(mnist.test.images).get_shape())
it turns out that the accuracy rate is different from CPU to GPU: when GPU returns the accuracy rate approximately 0.9xx while CPU returns only 0.3xx. Does anyone know the reason? or why can that issue happen?
回答1:
There are two primary reasons for this kind of behavior (besides bugs).
Numerical stability
It turns out that adding numbers is not entirely as easy as it might seem. Let's say I want to add a trillion 2's together. The correct answer is two trillion. But if you add these together in floating point on a machine with a wordsize of only, say 32 bits, after a while, your answer will get stuck at a smaller value. The reason is that after a while, the 2's that you're adding are below the smallest bit of the mantissa of the floating point sum.
These kinds of issues abound in numerical computing, and this particular discrepancy is known in TensorFlow (1,2, to name a few). It's possible that you're seeing an effect of this.
Initial conditions
Training a neural nets is a stochastic process, and as such, it depends on your initial conditions. Sometimes, especially if your hyperparameters are not tuned very well, your net will get stuck near a poor local minima, and you'll end up with mediocre behavior. Adjusting your optimizer parameters (or better, using an adaptive method like Adam) might help out here.
Of course, with all that said, this is a fairly large difference, so I'd double check your results before blaming it on the underlying math package or bad luck.
来源:https://stackoverflow.com/questions/43221730/tensorflow-same-code-but-get-different-result-from-cpu-device-to-gpu-device