Unable to execute device kernel in CUDA

后端未结

关注

 1  1354

-上瘾入骨i 2021-01-28 17:37

I am trying to call a device kernel within a global kernel. My global kernel is a Matrix Multiplication and my device kernel is finding the maximum value and the index in each c

1条回答

小鲜肉 (楼主)

2021-01-28 18:02

Apparently you are attempting to find the maximum value in each column, as well as the offset to that value.

But all of your threads in y are hammering on the same location for max value (max[x*2 + 0]). This isn't recommended, as there is no way to sort out a race condition. You should use atomic operations, or other methods (e.g. reduction) to handle multiple threads updating a single max value this way.

Since you have a need to update two values atomically (the max value and it's location), it's not a simple matter of replacing your plain access with a standard atomic function. However, since you are dealing with two 32-bit adjacent quantities, you may be interested in my answer here.

By the way I think matlab's native matrix multiply on gpuArray should be faster than any matrix multiply code you write. But it would require the Parallel Compute Toolbox.

0 讨论(0)
发布评论:

提交评论
- 加载中...