问题
Recently, I have used tensorflow to develop an NMT system. I tried to train this system on multi-gpus using data-parallelism method to speed up it. I follow the standard data-parallelism way widely used in tensorflow. For example, if we want to run it on a 8-gpus computer. First, we construct a large batch which contains 8 times the size of batch used in a single GPU. Then we split this large batch equally to 8 mini-batch. We separately train them in different gpus. In the end, we collect gradients to update paramters. But I find when I used dynamic_rnn, the average time taken by one iteration in 8 gpus is two times long of that taken by one iteration trained in a single gpu. I make sure the batch size for each gpu is the same. Who has a better way to speed up the training of RNN in tensorflow?
来源:https://stackoverflow.com/questions/47530023/data-parallelism-for-rnn-in-tensorflow