问题
I am running a large distributed Tensorflow model in google cloud ML engine. I want to use machines with GPUs. My graph consists of two main the parts the input/data reader function and the computation part.
I wish to place variables in the PS task, the input part in the CPU and the computation part on the GPU.
The function tf.train.replica_device_setter
automatically places variables in the PS server.
This is what my code looks like:
with tf.device(tf.train.replica_device_setter(cluster=cluster_spec)):
input_tensors = model.input_fn(...)
output_tensors = model.model_fn(input_tensors, ...)
Is it possible to use tf.device()
together with replica_device_setter()
as in:
with tf.device(tf.train.replica_device_setter(cluster=cluster_spec)):
with tf.device('/cpu:0')
input_tensors = model.input_fn(...)
with tf.device('/gpu:0')
tensor_dict = model.model_fn(input_tensors, ...)
Will the replica_divice_setter()
be overridden and variables not placed in the PS server?
Furthermore, since the device names in the cluster are something like job:master/replica:0/task:0/gpu:0
how do I say to Tensorflow tf.device(whatever/gpu:0)
?
回答1:
Any operations, beyond variables, in the tf.train.replica_device_setter block are automatically pinned to "/job:worker"
, which will default to the first device managed by the first task in the "worker" job.
You can pin them to another device (or task) by using embedded device block:
with tf.device(tf.train.replica_device_setter(ps_tasks=2, ps_device="/job:ps",
worker_device="/job:worker")):
v1 = tf.Variable(1., name="v1") # pinned to /job:ps/task:0 (defaults to /cpu:0)
v2 = tf.Variable(2., name="v2") # pinned to /job:ps/task:1 (defaults to /cpu:0)
v3 = tf.Variable(3., name="v3") # pinned to /job:ps/task:0 (defaults to /cpu:0)
s = v1 + v2 # pinned to /job:worker (defaults to task:0/cpu:0)
with tf.device("/task:1"):
p1 = 2 * s # pinned to /job:worker/task:1 (defaults to /cpu:0)
with tf.device("/cpu:0"):
p2 = 3 * s # pinned to /job:worker/task:1/cpu:0
来源:https://stackoverflow.com/questions/47791372/distributed-tensorflow-device-placement-in-google-cloud-ml-engine