I’m working with tensorflow and I want to speed up the prediction phase of a pre-trained Keras model (I\'m not interested in the training phase) by using
Here's my code that demonstrates how CPU and GPU execution can be done in parallel:
import tensorflow as tf
import numpy as np
from time import time
from threading import Thread
n = 1024 * 8
data_cpu = np.random.uniform(size=[n//16, n]).astype(np.float32)
data_gpu = np.random.uniform(size=[n , n]).astype(np.float32)
with tf.device('/cpu:0'):
x = tf.placeholder(name='x', dtype=tf.float32)
def get_var(name):
return tf.get_variable(name, shape=[n, n])
def op(name):
w = get_var(name)
y = x
for _ in range(8):
y = tf.matmul(y, w)
return y
with tf.device('/cpu:0'):
cpu = op('w_cpu')
with tf.device('/gpu:0'):
gpu = op('w_gpu')
def f(session, y, data):
return session.run(y, feed_dict={x : data})
with tf.Session(config=tf.ConfigProto(log_device_placement=True, intra_op_parallelism_threads=8)) as sess:
sess.run(tf.global_variables_initializer())
coord = tf.train.Coordinator()
threads = []
# comment out 0 or 1 of the following 2 lines:
threads += [Thread(target=f, args=(sess, cpu, data_cpu))]
threads += [Thread(target=f, args=(sess, gpu, data_gpu))]
t0 = time()
for t in threads:
t.start()
coord.join(threads)
t1 = time()
print t1 - t0
The timing results are:
CPU thread: 4-5s (will vary by machine, of course).
GPU thread: 5s (It does 16x as much work).
Both at the same time: 5s
Note that there was no need to have 2 sessions (but that worked for me too).
The reasons you might be seeing different results could be
some contention for system resources (GPU execution does consume some host system resources, and if running the CPU thread crowds it, that could worsen the performance)
incorrect timing
part of your model can only run on GPU/CPU
bottleneck elsewhere
some other problem