What is the difference between cudaMemcpy() and cudaMemcpyPeer() for P2P-copy?

问题

I want to copy data from GPU0-DDR to GPU1-DDR directly without CPU-RAM.

As said here on the page-15: http://people.maths.ox.ac.uk/gilesm/cuda/MultiGPU_Programming.pdf

Peer-to-Peer Memcpy
 Direct copy from pointer on GPU A to pointer on GPU B

 With UVA, just use cudaMemcpy(…, cudaMemcpyDefault)
     Or cudaMemcpyAsync(…, cudaMemcpyDefault)

 Also non-UVA explicit P2P copies:
     cudaError_t cudaMemcpyPeer( void * dst, int dstDevice, const void* src, 
        int srcDevice, size_t count )
     cudaError_t cudaMemcpyPeerAsync( void * dst, int dstDevice,
        const void* src, int srcDevice, size_t count, cuda_stream_t stream = 0 )

If I use cudaMemcpy() then do I must at first to set a flag cudaSetDeviceFlags( cudaDeviceMapHost )?
Do I have to use cudaMemcpy() pointers which I got as result from the function cudaHostGetDevicePointer(& uva_ptr, ptr, 0)?
Are there any advantages of function cudaMemcpyPeer(), and if no any advantage, why it is needed?

回答1:

Unified Virtual Addressing (UVA) enables one address space for all CPU and GPU memories since it allows determining physical memory location from pointer value.

Peer-to-peer memcpy with UVA*

When UVA is possible, then cudaMemcpy can be used for peer-to-peer memcpy since CUDA can infer which device "owns" which memory. The instructions you typically need to perform a peer-to-peer memcpy with UVA are the following:

//Check for peer access between participating GPUs: 
cudaDeviceCanAccessPeer(&can_access_peer_0_1, gpuid_0, gpuid_1);
cudaDeviceCanAccessPeer(&can_access_peer_1_0, gpuid_1, gpuid_0);

//Enable peer access between participating GPUs:
cudaSetDevice(gpuid_0);
cudaDeviceEnablePeerAccess(gpuid_1, 0);
cudaSetDevice(gpuid_1);
cudaDeviceEnablePeerAccess(gpuid_0, 0);

//UVA memory copy:
cudaMemcpy(gpu0_buf, gpu1_buf, buf_size, cudaMemcpyDefault);

Peer-to-peer memcpy without UVA

When UVA is not possible, then peer-to-peer memcpy is done via cudaMemcpyPeer. Here is an example

// Set device 0 as current
cudaSetDevice(0); 
float* p0;
size_t size = 1024 * sizeof(float);
// Allocate memory on device 0
cudaMalloc(&p0, size); 
// Set device 1 as current
cudaSetDevice(1); 
float* p1;
// Allocate memory on device 1
cudaMalloc(&p1, size); 
// Set device 0 as current
cudaSetDevice(0);
// Launch kernel on device 0
MyKernel<<<1000, 128>>>(p0); 
// Set device 1 as current
cudaSetDevice(1); 
// Copy p0 to p1
cudaMemcpyPeer(p1, 1, p0, 0, size); 
// Launch kernel on device 1
MyKernel<<<1000, 128>>>(p1);

As you can see, while in the former case (UVA possible) you don't need to specify which device the different pointers refer to, in the latter case (UVA not possible) you have to explicitly mention which device the pointers refer to.

The instruction

cudaSetDeviceFlags(cudaDeviceMapHost);

is used to enable host mapping to device memory, which is a different thing and regards host<->device memory movements and not peer-to-peer memory movements, which is the topic of your post.

In conclusion, the answer to your questions are:

NO;
NO;
When possible, enable UVA and use cudaMemcpy (you don't need to specify the devices); otherwise, use cudaMemcpyPeer (and you need to specify the devices).

来源：https://stackoverflow.com/questions/22694518/what-is-the-difference-between-cudamemcpy-and-cudamemcpypeer-for-p2p-copy

标签

cuda

gpgpu

nvidia