Cuda: pinned memory zero copy problems

前端未结

关注

 1  1499

情歌与酒 2021-01-13 18:15

I tried the code in this link Is CUDA pinned memory zero-copy? The one who asked claims the program worked fine for him But does not work the same way on mine the values do

1条回答

无人及你 (楼主)

2021-01-13 19:03

First of all, to allocate ZeroCopy memory, you have to specify cudaHostAllocMapped flag as an argument to cudaHostAlloc.

cudaHostAlloc((void **)&pinnedHostPtr, THREADS * sizeof(double), cudaHostAllocMapped);

Still the pinnedHostPointer will be used to access the mapped memory from the host side only. To access the same memory from device, you have to get the device side pointer to the memory like this:

double* dPtr;
cudaHostGetDevicePointer(&dPtr, pinnedHostPtr, 0);

Pass this pointer as kernel argument.

testPinnedMemory<<< numBlocks, threadsPerBlock>>>(dPtr);

Also, you have to synchronize the kernel execution with the host to read the updated values. Just add cudaDeviceSynchronize after the kernel call.

The code in the linked question is working, because the person who asked the question is running the code on a 64 bit OS with a GPU of Compute Capability 2.0 and TCC enabled. This configuration automatically enables the Unified Virtual Addressing feature of the GPU in which the device sees host + device memory as a single large memory instead of separate ones and host pointers allocated using cudaHostAlloc can be passed directly to the kernel.

In your case, the final code will look like this:

#include 

__global__ void testPinnedMemory(double * mem)
{
    double currentValue = mem[threadIdx.x];
    printf("Thread id: %d, memory content: %f\n", threadIdx.x, currentValue);
    mem[threadIdx.x] = currentValue+10;
}

int main() 
{
    const size_t THREADS = 8;
    double * pinnedHostPtr;
    cudaHostAlloc((void **)&pinnedHostPtr, THREADS * sizeof(double), cudaHostAllocMapped);

    //set memory values
    for (size_t i = 0; i < THREADS; ++i)
        pinnedHostPtr[i] = i;

    double* dPtr;
    cudaHostGetDevicePointer(&dPtr, pinnedHostPtr, 0);

    //call kernel
    dim3 threadsPerBlock(THREADS);
    dim3 numBlocks(1);
    testPinnedMemory<<< numBlocks, threadsPerBlock>>>(dPtr);
    cudaDeviceSynchronize();

    //read output
    printf("Data after kernel execution: ");
    for (int i = 0; i < THREADS; ++i)
        printf("%f ", pinnedHostPtr[i]);    
    printf("\n");

    return 0;
}

0 讨论(0)