发表新帖

发表新帖

What is the reason to use parameter server in distributed tensorflow learning?

后端未结

关注

 2  421

Short version: can\'t we store variables in one of the workers and not use parameter servers?

Long version: I want to implement synchro

相关标签:

2条回答

花落未央

2021-01-31 06:56

Another possibility is to use a distributed version of TensorFlow, which automatically handles the data distribution and execution on multiple nodes by using MPI in the backend.

We have recently developed one such version at MaTEx: https://github.com/matex-org/matex, and a paper describing https://arxiv.org/abs/1704.04560

It does synchronous training and provides several parallel dataset reader format.

We will be happy to help you if you need more help!

0 讨论(0)
发布评论:

提交评论
- 加载中...
名媛妹妹

2021-01-31 07:06

Using parameter server can give you better network utilization, and lets you scale your models to more machines.

A concrete example, suppose you have 250M parameters, it takes 1 second to compute gradient on each worker, and there are 10 workers. This means that each worker has to send/receive 1 GB of data to 9 other workers every second, which needs 72 Gbps full duplex network capacity on each worker, which is not practical.

More realistically you could have 10 Gbps network capacity per worker. You prevent network bottlenecks by using parameter server split over 8 machines. Each worker machine communicates with each parameter machine for 1/8th of parameters.

0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题