why can't I get the right sum of 1D array with numba (cuda python)?

扶醉桌前 提交于 2019-11-27 16:28:59
talonmies

The reason you don't get the sum you expect is because you haven't written code to produce that sum.

The basic CUDA programming model (whether you use CUDA C, Fortran or Python as your language) is that you write kernel code which is executed by each thread. You have written code for each thread to read and sum part of the input array. You have not written any code for those threads to share and sum their individual partial sums into a final sum.

There is an extremely well described algorithm for doing this -- it is called a parallel reduction. You can find an introduction to the algorithm in a PDF which ships in the examples of every version of the CUDA toolkit, or download a presentation about it here. You can also read a more modern version of the algorithm which uses newer features of CUDA (warp shuffle instructions and atomic transactions) here.

After you have studied the reduction algorithm, you will need to adapt the standard CUDA C kernel code into the Numba Python kernel dialect. At the bare minimum, something like this:

tpb = (1,3) 

@cuda.jit
def calcu_sum(D,T):

    ty = cuda.threadIdx.y
    bh = cuda.blockDim.y
    index_i = ty
    sbuf = cuda.shared.array(tpb, float32)

    L = len(D)
    su = 0
    while index_i < L:
        su += D[index_i]
        index_i +=bh

    print('su:',su)

    sbuf[0,ty] = su
    cuda.syncthreads()

    if ty == 0:
        T[0,0] = 0
        for i in range(0, bh):
            T[0,0] += sbuf[0,i]
        print('T:',T[0,0])

will probably do what you want, although it is still a long way from an optimal parallel shared memory reduction, as you will see when you read the material I provided links to.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!