I have simulation program which requires a lot of data. I load the data in the GPUs for calculation and there is a lot of dependency in the data. Since 1 GPU was not enough
PCI Express has full speed in both directions. There should be no "deadlock" like you may experience in a synchronous MPI communication that needs handshaking before proceeding.
As Robert mentioned in a comment "accessing data over PCIE bus is a lot slower than accessing it from on-board memory". However, it should be significantly faster than transferring data from GPU1 to CPU, then from CPU to GPU2 since you don't have to copy it twice.
You should try to minimize the amount of GPU to GPU transfers, especially if you have to sync data before you do it (could happen in some algorithms). However, you could also try to do some concurrent execution while transferring data. You can look at the Peer-to-Peer memory section of the CUDA C guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/#peer-to-peer-memory-copy