what's the correct and most efficient way to use mapped(zero-copy) memory mechanism in Nvidia OpenCL environment?

∥☆過路亽.° 提交于 2019-12-03 22:21:22

First thing to note here is, OpenCL does not allow pinned zero-copy (in 2.0 it is available, but not yet ready to use). This mean you will have to perform a copy anyway to the GPU memory.

There are 2 ways to perform the memory copy:

  1. clEnqueueWriteBuffer()/clEnqueueReadBuffer(): These perform a direct copy from/to an OpenCL object in the context (typically in the device) to a host side pointer. The efficiency is high, but maybe they are not efficient for small quantities of bytes.

  2. clEnqueueMapBuffer()/clEnqueueUnmapBuffer(): These calls first map a device memory zone to the host memory zone. This map generates a 1:1 copy of the memory. Then, after the map you can play with that memory using memcopy() or other approaches. After you finish with the memory editing you call the unmap, which then transfers this memory back to the device. Typically this option is faster, since OpenCL gives you the pointer when you map. It is likely you are already writing in the host cache of the context. But the counterpart is that when you call map the memory transfer is occurring the other way around (GPU->host)

EDIT: In this last case if you select the flag CL_WRITE_ONLY for mapping, it is probably NOT triggering a device to host copy on the map operation. The same thing happens with the read only, that will NOT trigger a device copy on the unmap.

In your example, it is clear that using the Map/Unmap approach the operation is going to be faster. However, if you do memcpy() inside a loop without calling the unmap, that is effectively NOT copying anything to the device side. If you put a loop of map/unmap the performance is going to decrease, and if the buffer size is small (1MB) the transfer rates will be very poor. However this will also happen in the Write/Read case if you perform the writes in a for loop with small sizes.

In general, you should not use 1MB sizes, since the overhead will be very high in this cases (unless you queue many write call in a non-blocking mode).

PD: My personal recomendation is, simply to use the normal Write/Read, since the difference is not noticeable for most common uses. Specially with overlapped I/O and kernel executions. But if you really need the performance, use map/unmap or pinned memory with Read/Write it should give 10-30% better transfer rates.


EDIT: Related to the behaviour you are experiencing, after examining nVIDIA code I can explain it to you. The problem you see is mainly generated by blocking and non-blocking calls that "hide" the overheads of the OpenCL calls

The first code: (nVIDIA)

  • Is queueing a BLOCKING map once
  • Then performing many memcpys (but only the last one will go to the GPU side).
  • Then unmapping it in a non-blocking manner.
  • Reading the result without clFinish()

This code example is WRONG! It is not really measuring the HOST-GPU copy speed. Because the memcpy() does not ensure a GPU copy and because there is a clFinish() missing. That's why you even see speeds over the limit.

The second code: (your's)

  • Is queueing a BLOCKING map many times in a loop.
  • Then performing 1 memcpy() for each map.
  • Then unmapping it in a non-blocking manner.
  • Reading the result without clFinish()

Your code only lacks of the clFinish(). However as the map in the loop is blocking the results are almost correct. However, the GPU is idle until the CPU attends the next iteration, so you are seeing a non-realistic very low performance.

The Write/Read code: (nVIDIA)

  • Is queueing a nonblocking write many times.
  • Reading the result with clFinish()

This code is properly doing the copy, in parallel, and you are seeing the real bandwidth here.

In order to convert the map example into something comparable to the Write/Read case. You should so it like this (this is without pinned memory):

//create N buffers in device
for(int i=0; i<MEMCOPY_ITERATIONS; i++)
    cmDevData[i] = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, memSize, NULL, &ciErrNum);

// get pointers mapped to device buffers cmDevData
void* dm_idata[MEMCOPY_ITERATIONS];
dm_idata[i] = clEnqueueMapBuffer(cqCommandQueue, cmDevData[i], CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);

//Measure the STARTIME
for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
{
    // copy data from host to device by memcpy
    memcpy(dm_idata[i], h_data, memSize);

    //unmap device buffer.
    ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData, dm_idata, 0, NULL, NULL);
}
clFinish(cqCommandQueue);

//Measure the ENDTIME

You can't reuse the same buffer in the mapped case since otherwise after each iteration you would block. And the GPU will be idle until the CPU requeues the next copy job.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!