Unable to execute device kernel in CUDA

后端 未结 1 1354
-上瘾入骨i
-上瘾入骨i 2021-01-28 17:37

I am trying to call a device kernel within a global kernel. My global kernel is a Matrix Multiplication and my device kernel is finding the maximum value and the index in each c

1条回答
  •  小鲜肉
    小鲜肉 (楼主)
    2021-01-28 18:02

    Apparently you are attempting to find the maximum value in each column, as well as the offset to that value.

    But all of your threads in y are hammering on the same location for max value (max[x*2 + 0]). This isn't recommended, as there is no way to sort out a race condition. You should use atomic operations, or other methods (e.g. reduction) to handle multiple threads updating a single max value this way.

    Since you have a need to update two values atomically (the max value and it's location), it's not a simple matter of replacing your plain access with a standard atomic function. However, since you are dealing with two 32-bit adjacent quantities, you may be interested in my answer here.

    By the way I think matlab's native matrix multiply on gpuArray should be faster than any matrix multiply code you write. But it would require the Parallel Compute Toolbox.

    0 讨论(0)
提交回复
热议问题