问题
I ran valgrind
to one of my open-source OpenCL codes (https://github.com/fangq/mmc), and it detected a lot of memory leaks in the OpenCL host code. Most of those pointed back to the line where I created the context object using clCreateContextFromType
.
I double checked all my OpenCL variables, command queues, kernels and programs, and made sure that they are all properly released, but still, when testing on sample programs, every call to the mmc_run_cl() function bumps up memory by 300MB-400MB and won't release at return.
you can reproduce the valgrind report by running the below commands in a terminal:
git clone https://github.com/fangq/mmc.git
cd mmc/src
make clean
make all
cd ../examples/validation
valgrind --show-leak-kinds=all --leak-check=full ../../src/bin/mmc -f cube2.inp -G 1 -s cube2 -n 1e4 -b 0 -D TP -M G -F bin
assuming you system has gcc/git/libOpenCL and valgrind installed. Change the -G 1
input to a different number if you want to run it on other OpenCL devices (add -L
to list).
In the below table, I list the repeated count of each valgrind detected leaks on an NVIDIA GPU (TitanV) on a Linux box (Ubuntu 16.04) with the latest driver+cuda 9.
Again, most leaks are associated with the clCreateContextFromType
line, which I assume some GPU memories are not released, but I did released all GPU resources at the end of the host code.
do you notice anything that I missed in my host code? your input is much appreciated
counts | error message
------------------------------------------------------------------------------------
380 ==27828== by 0x402C77: main (mmc.c:67)
Code: entry point to the below errors
64 ==27828== by 0x41CF02: mcx_list_gpu (mmc_cl_utils.c:135)
Code: OCL_ASSERT((clGetPlatformIDs(0, NULL, &numPlatforms)));
4 ==27828== by 0x41D032: mcx_list_gpu (mmc_cl_utils.c:154)
Code: context=clCreateContextFromType(cps,devtype[j],NULL,NULL,&status);
58 ==27828== by 0x41DF8A: mmc_run_cl (mmc_cl_host.c:111)
Code: entry point to the below errors
438 ==27828== by 0x41E006: mmc_run_cl (mmc_cl_host.c:124)
Code: OCL_ASSERT(((mcxcontext=clCreateContextFromType(cprops,CL_DEVICE_TYPE_ALL,...));
13 ==27828== by 0x41E238: mmc_run_cl (mmc_cl_host.c:144)
Code: OCL_ASSERT(((mcxqueue[i]=clCreateCommandQueue(mcxcontext,devices[i],prop,&status),status)));
1 ==27828== by 0x41E7A6: mmc_run_cl (mmc_cl_host.c:224)
Code: OCL_ASSERT(((gprogress[0]=clCreateBufferNV(mcxcontext,CL_MEM_READ_WRITE, NV_PIN, ...);
1 ==27828== by 0x41E7F9: mmc_run_cl (mmc_cl_host.c:225)
Code: progress = (cl_uint *)clEnqueueMapBuffer(mcxqueue[0], gprogress[0], CL_TRUE, ...);
10 ==27828== by 0x41EDFA: mmc_run_cl (mmc_cl_host.c:290)
Code: status=clBuildProgram(mcxprogram, 0, NULL, opt, NULL, NULL);
7 ==27828== by 0x41F95C: mmc_run_cl (mmc_cl_host.c:417)
Code: OCL_ASSERT((clEnqueueReadBuffer(mcxqueue[devid],greporter[devid],CL_TRUE,0,...));
Update [04/11/2020]:
Reading @doqtor's comment, I did the following test on 5 difference devices, 2 NVIDIA GPUs, 2 AMD GPUs and 1 Intel CPU. What he said was correct - the memory leak does not happen on the Intel OpenCL library, I also found that AMD OpenCL driver is fine too. The only problem is that NVIDIA OpenCL library seems to have a leak on both GPUs I tested (Titan V and RTX2080).
My test results are below. Memory/CPU profiling using psrecord introduced in this post.
I will open a new question and bounty on how to reduce this memory leak with NVIDIA OpenCL. If you have any experience in this, please share. will post the link below. thanks
回答1:
I double checked all my OpenCL variables, command queues, kernels and programs, and made sure that they are all properly released...
Well I still found one (tiny) memory leak in mmc code:
==15320== 8 bytes in 1 blocks are definitely lost in loss record 14 of 1,905
==15320== at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15320== by 0x128D48: mmc_run_cl (mmc_cl_host.c:137)
==15320== by 0x11E71E: main (mmc.c:67)
Memory allocated by greporter
isn't freed. So that's to be fixed by you.
The rest are potential memory leaks in OpenCL library. They may or may not to be a memory leaks as for example the library may use custom memory allocators which valgrind
does not recognizes or does some other tricks. There is a lot threads about that:
- clGetPlatformIDs Memory Leak
- https://software.intel.com/en-us/forums/opencl/topic/753786
- https://github.com/KhronosGroup/OpenCL-ICD-Loader/issues/13
- OpenCL clGetPlatformIDs gives around 230 valgrind memcheck errors
In general you can't do much about that unless you want to dive into the library code and do something about that. I would suggest to carefully suppress those reported which are coming from the library. The suppression file can be generated as described in the valgrind manual: https://valgrind.org/docs/manual/manual-core.html#manual-core.suppress
... but still, when testing on sample programs, every call to the mmc_run_cl() function bumps up memory by 300MB-400MB and won't release at return
How did you checked that? I haven't seen memory suspiciously growing. I set -n 1000e4
and it made it to run for like 2 minutes where the memory allocated stayed still for all the time at ~0.6% of my RAM size. Note that I didn't use nvidia CUDA
but POCL
on Intel GPU and CPU and linked with libOpenCL
installed from ocl-icd-libopencl1:amd64
package on Ubuntu 18.04. So you may try to give that a go and check if that changes anything.
======== Update ================================
I've re-run it as you described in the comment and after first iteration the memory usage was 0.6% then after 2nd iteration it increased to 0.9% and after that the next iterations didn't increase memory usage. Valgrind also didn't report anything newer besides what I observed earlier. So I would suggest to link with different than nvidia-cuda libOpenCL and retest.
来源:https://stackoverflow.com/questions/61091039/opencl-clcreatecontextfromtype-function-results-in-memory-leaks