Cuda hangs on cudaDeviceSynchronize randomly [closed]

冷暖自知 提交于 2019-12-10 13:12:37

问题


I have a piece of GPU code that has worked for a while. I recently made a couple minor algorithmic changes, but they didn't touch the CUDA part.

I'm running production runs on a set of three Xeon machines, each with a 780 Ti in it. Each run takes about three minutes to complete, but at this point there have been two cases (out of 5000) where the application has hung for hours (until killed). Both were on the same machine.

The second time, I attached GDB to the running process, and got a backtrace that looks like

#0  0x00007fff077ffa01 in clock_gettime ()
#1  0x0000003e1ec03e46 in clock_gettime () from /lib64/librt.so.1
#2  0x00002b5b5e302a1e in ?? () from /usr/lib64/libcuda.so
#3  0x00002b5b5dca2294 in ?? () from /usr/lib64/libcuda.so
#4  0x00002b5b5dbbaa4f in ?? () from /usr/lib64/libcuda.so
#5  0x00002b5b5dba8cda in ?? () from /usr/lib64/libcuda.so
#6  0x00002b5b5db94c4f in cuCtxSynchronize () from /usr/lib64/libcuda.so
#7  0x000000000041cd8d in cudart::cudaApiDeviceSynchronize() ()
#8  0x0000000000441269 in cudaDeviceSynchronize ()
#9  0x0000000000408124 in main (argc=11, argv=0x7fff076fa1d8) at src/fraps3d.cu:200

I manually did a frame 8; return; to forcibly make it finish, which caused it to end up stuck on the next cudaDeviceSynchronize() call. Doing it again got it stuck on the next synchronization call after that (every time with the same frames 0 through 8). Extra strangely, the failure happened in the middle of the main loop, on the ~5000th time through.

After killing it, the next jobs starts and runs properly, so it doesn't appear to be a systemic failure of the execution host.

Any ideas about what could cause a random failure like this?

I'm compiling and running with V6.0.1, running with driver version 331.62.

来源:https://stackoverflow.com/questions/25979764/cuda-hangs-on-cudadevicesynchronize-randomly

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!