As the tensorflow paper states, Tensorflow\' cross-device communication is achieved by adding \"receive node\" and \"send node\" into devices.
From my understanding,
In TensorFlow, cross-device communication is achieved using the Rendezvous interface, which has multiple different implementations, depending on the deployment. The comment on that interface describes the general idea:
// A Rendezvous is an abstraction for passing a Tensor
// from a producer to a consumer, where the consumer may safely
// request the Tensor before or after it has been produced. A
// producer never blocks when using a Rendezvous. A consumer has the
// choice of making a blocking call or providing a callback: in either
// case, the consumer receives the Tensor as soon as it is available.
As you noted in your question, TensorFlow represents communication in the dataflow graph using Send
and Recv
ops that are added to the graph automatically when the graph is partitioned across devices. For each edge that has a source and destination on different devices, the graph partitioner inserts a pair of Send
and Recv
ops that share the same "rendezvous key" (an automatically generated string name that is used as a key in the rendezvous' index of pending tensors to be communicated). The implementation of the Send op is simple: it calls Rendezvous::Send()
, passing in its rendezvous key and single input tensor, then returns immediately without blocking. The implementation of the Recv op is slightly more complicated: it registers a callback to be called when the tensor with the given key becomes available. That callback is responsible for "producing" the output of the Recv
op, and unblocking subsequent computation.
The Rendezvous
implementations perform the actual work of transferring the data:
IntraProcessRendezvous handles the transfer of data between devices in the same process. In the (unlikely) event that the transfer is between two CPU devices in the same process, the transfer can be achieved by a simple Tensor assignment. Otherwise, TensorFlow kicks off a device-specific DMA routine to transfer data between a CPU and GPU device.
The BaseRemoteRendezvous class and its subclasses handle cross-device communication in the case that the send and receiver can be in different processes. The main implementation of this class is RpcRemoteRendezvous, which uses gRPC to handle the remote transfers.