Numba support for cuda cooperative block synchronization?? Python numba cuda grid sync

∥☆過路亽.° 提交于 2020-01-24 12:44:25

问题


Numba Cuda has syncthreads() to sync all thread within a block. How can I sync all blocks in a grid without exiting the current kernel?

In C-Cuda there's a cooperativeBlocks library to handle this case. I can't find something like that in the Numba Docs.

Why this matters a lot!

This sort of thing happens in reductions where one computes something in each block, then you want to find the maximum over the blocks.

Trivially one could push these into the stream as two separate calls. This assures that the block computes are all finished before the call to reduce.

But if those two operations are lightweight, then the execution time is dominated by setting up the kernels not by the execution of the operations. If these are inside a python loop, the loop could easily run 1000 times faster if the loop and the two kerel calls could be fused into one kernel

for u in range(100000):
   Amax =CudaFindArrayMaximum(A)
   CudaDivideArray(A,Amax)
   CudaDoSomethingWithMatrix(A)

since each of the three lines in the loop are fast kernels, I'd like to put them and the loop all into one single kernel.

But I can't think of any way to do that without synching across all the blocks in the grid. INdeed even the very first step of finding the maximum is tricky in itself for the same reason.


回答1:


In CUDA, without the use of cooperative groups, there is no safe or reliable mechanism to do a grid-wide sync (other than using the kernel launch boundary). In fact, providing this capability was one of the motivations behind the introduction of cooperative groups.

Currently, numba does not expose cooperative groups functionality. Therefore there is no safe or reliable way to achieve this within the numba capabilities, currently.

Refer to this question for an example of a possible hazard in trying to do this in CUDA without cooperative groups.



来源:https://stackoverflow.com/questions/54595609/numba-support-for-cuda-cooperative-block-synchronization-python-numba-cuda-gri

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!