Distributed tensorflow parameter server and workers

*爱你&永不变心* 提交于 2019-12-10 10:08:19

问题


I was closely following the Imagenet distributed TF train example.

I am not able to understand how distribution of data takes place when this example is being run on 2 different workers? In theory, different workers should see the different part of the data. Also, what part of the code tells the parameters to pass on the parameter server? Like in the multi-gpu example, there is explicit section for the 'cpu:0'.


回答1:


The different workers see different parts of the data by virtue of dequeuing a mini batch images from a single queue of preprocessed images. To elaborate, in the distributed setup for training the Imagenet model, the input images are preprocessed by multiple threads and the preprocessed images are stored in a single RandomShuffleQueue. You can look for tf.RandomShuffleQueue in this file to see how this is done. The multiple workers are organized as 'Inception towers' and each tower dequeues a mini batch of images from the same queue, and thus get different parts of the input. The picture here answers the second part of your question. Look for slim.variables.VariableDeviceChooser in this file. The logic there makes sure that Variable objects are assigned evenly to workers that act as parameter servers. All other workers doing the actual training fetch the variables at the beginning of a step and update them at the end of the step.



来源:https://stackoverflow.com/questions/38185702/distributed-tensorflow-parameter-server-and-workers

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!