How to copy memory between different gpus in cuda

后端未结

关注

 1  917

Currently I\'m work with two gtx 650 . My program resembles in simple Clients/Server structure. I distribute the work threads on the two gpus. The Server thread need to gather

相关标签:

1条回答

既然无缘

2021-02-04 21:53
Transferring data from one GPU to another will often require a "staging" through host memory. The exception to this is when the GPUs and the system topology support peer-to-peer (P2P) access and P2P has been explicitly enabled. In that case, data transfers can flow directly over the PCIE bus from one GPU to another.

In either case (with or without P2P being available/enabled) the typical cuda runtime API call would be cudaMemcpyPeer/cudaMemcpyPeerAsync as demonstrated in the cuda p2pBandwidthLatencyTest sample code.

On windows, one of the requirements of P2P is that both devices be supported by a driver in TCC mode. TCC mode is, for the most part, not an available option for GeForce GPUs (recently, an exception is made for GeForce Titan family GPUs using drivers and runtime available in the CUDA 7.5RC toolkit.)

Therefore, on Windows, these GPUs will not be able to take advantage of direct P2P transfers. Nevertheless, a nearly identical sequence can be used to transfer data. The CUDA runtime will detect the nature of the transfer, and perform an allocation "under the hood" to create a staging buffer. The transfer will then be completed in 2 parts: a transfer from the originating device to the staging buffer, and a transfer from the staging buffer to the destination device.

The following is a fully worked example showing how to transfer data from one GPU to another, while taking advantage of P2P access if it is available:
```
$ cat t850.cu
#include <stdio.h>
#include <math.h>

#define SRC_DEV 0
#define DST_DEV 1

#define DSIZE (8*1048576)

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)


int main(int argc, char *argv[]){

  int disablePeer = 0;
  if (argc > 1) disablePeer = 1;
  int devcount;
  cudaGetDeviceCount(&devcount);
  cudaCheckErrors("cuda failure");
  int srcdev = SRC_DEV;
  int dstdev = DST_DEV;
  if (devcount <= max(srcdev,dstdev)) {printf("not enough cuda devices for the requested operation\n"); return 1;}
  int *d_s, *d_d, *h;
  int dsize = DSIZE*sizeof(int);
  h = (int *)malloc(dsize);
  if (h == NULL) {printf("malloc fail\n"); return 1;}
  for (int i = 0; i < DSIZE; i++) h[i] = i;
  int canAccessPeer = 0;
  if (!disablePeer) cudaDeviceCanAccessPeer(&canAccessPeer, srcdev, dstdev);
  cudaSetDevice(srcdev);
  cudaMalloc(&d_s, dsize);
  cudaMemcpy(d_s, h, dsize, cudaMemcpyHostToDevice);
  if (canAccessPeer) cudaDeviceEnablePeerAccess(dstdev,0);
  cudaSetDevice(dstdev);
  cudaMalloc(&d_d, dsize);
  cudaMemset(d_d, 0, dsize);
  if (canAccessPeer) cudaDeviceEnablePeerAccess(srcdev,0);
  cudaCheckErrors("cudaMalloc/cudaMemset fail");
  if (canAccessPeer) printf("Timing P2P transfer");
  else printf("Timing ordinary transfer");
  printf(" of %d bytes\n", dsize);
  cudaEvent_t start, stop;
  cudaEventCreate(&start); cudaEventCreate(&stop);
  cudaEventRecord(start);
  cudaMemcpyPeer(d_d, dstdev, d_s, srcdev, dsize);
  cudaCheckErrors("cudaMemcpyPeer fail");
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  float et;
  cudaEventElapsedTime(&et, start, stop);
  cudaSetDevice(dstdev);
  cudaMemcpy(h, d_d, dsize, cudaMemcpyDeviceToHost);
  cudaCheckErrors("cudaMemcpy fail");
  for (int i = 0; i < DSIZE; i++) if (h[i] != i) {printf("transfer failure\n"); return 1;}
  printf("transfer took %fms\n", et);
  return 0;
}

$ nvcc -arch=sm_20 -o t850 t850.cu
$ ./t850
Timing P2P transfer of 33554432 bytes
transfer took 5.135680ms
$ ./t850 disable
Timing ordinary transfer of 33554432 bytes
transfer took 7.274336ms
$
```
Notes:
1. Passing any command line parameter will disable the use of P2P even if it is available.
2. The above results are for a system where P2P access is possible, and both GPUs are connected via a PCIE Gen2 link, capable of about 6GB/s transfer bandwidth in a single direction. The P2P transfer time is consistent with this (32MB/5ms ~= 6GB/s). The non-P2P transfer time is longer, but not double. This is due to the fact that for transfers to/from the staging buffer, after some data is transferred into the staging buffer, the outgoing transfer can begin. The driver/runtime takes advantage of this to partially overlap the data transfers.
Note that in general, P2P support may vary by GPU or GPU family. The ability to run P2P on one GPU type or GPU family does not necessarily indicate it will work on another GPU type or family, even in the same system/setup. The final determinant of GPU P2P support are the tools provided that query the runtime via cudaDeviceCanAccessPeer. P2P support can vary by system and other factors as well. No statements made here are a guarantee of P2P support for any particular GPU in any particular setup.

Note: The TCC driver requirement in windows has been relaxed with recent drivers. With recent drivers, it should be possible to exchange P2P data between devices in WDDM mode, as long as the rest of the requirements are met.

The statement about TCC support is a general one. Not all GPUs are supported. The final determinant of support for TCC (or not) on a particular GPU is the nvidia-smi tool. Nothing here should be construed as a guarantee of support for TCC on your particular GPU.
0 讨论(0)
发布评论:

提交评论
- 加载中...