问题
I am executing monte carlo sweeps on a population of replicas of my system using OpenCL kernels. After the initial debugging phase I increased some of the arguments to more realistic values and noticed that the program is suddenly eating up large amounts of host memory. I am executing 1000 sweeps on about 4000 replicas, each sweep consists of 2 kernel invocations. That results in about 8 million kernel invocations.
The source of the memory usage was easy to find (see screenshot).
- While the the kernel executions are enqueued the memory usage goes up.
- While the kernels are executing the memory usage stays constant.
- Once the kernels finish up the usage goes down to its original state.
- I did not allocate any memory, as can be seen in the memory snapshots.
That means the OpenCL driver is using the memory. I understand that it must keep a copy of all the arguments to the kernel invocations and also the global and local workgroup size, but that does not add up.
The peak memory usage was 4.5GB. Before enqueuing the kernels about 250MB were used. That means OpenCL used about 4.25GB for 8 million invocations, i.e. about half a kilobyte per invocation.
So my questions are:
- Is that kind of memory usage normal and to be expected?
- Are there good/known techniques to reduce memory usage?
- Maybe I should not enqueue so many kernels simultaneously, but how would I do that without causing synchronization, e.g. with
clFinish()
?
回答1:
Enqueueing large number of kernel invocations needs to be done in a bit controlled manner so that command queue does not eat too much memory. First, clFlush
may help to some degree then clWaitForEvents
is necessary to make a synchronization point in the middle such that for example 2000 kernel invocations is enqueued and clWaitForEvents
waits for the 1000th one. Device is not going to pause because we have another 1000 invocations of work pre-batched already. Then similar thing needs to be repeated again and again. This could be illustrated this way:
enqueue 999 kernel commands
while(invocations < 8000000)
{
enqueue 1 kernel command with an event
enqueue 999 kernel commands
wait for the event
}
The optimal number of kernel invocations after which we should wait may be different than presented here so it needs to be worked out for the given scenario.
来源:https://stackoverflow.com/questions/31925672/opencl-enqueued-kernels-using-lots-of-host-memory