Simple script below is launched with args shown in it\'s header. It behaves differently, but often one of the workers hangs and prints these \"CreateSession still waiting fo
By default, a distributed TensorFlow session will attempt to connect to all servers named in the tf.train.ClusterSpec
, and will block until they respond. This provides a useful barrier that ensures that all workers have become ready to receive computation requests before returning control to the user. This barrier happens before the MonitoredTrainingSession
code that waits for the chief to initialize variables.
If you don't want your session to wait on all servers (e.g. just wait on tasks in "/job:ps"
and not the other tasks in "/job:worker"
, which is a common between-graph deployment strategy), the easiest option is to specify a "device filter" when you create your session. The device filter is a whitelist of (partial) device specifications that determines which tasks a tf.Session
will contact at startup. For example, the mnist_replica.py
test specifies a device filter as part of the tf.ConfigProto
that is used to configure the session.