I am trying to call a device kernel within a global kernel. My global kernel is a Matrix Multiplication and my device kernel is finding the maximum value and the index in each c
Apparently you are attempting to find the maximum value in each column, as well as the offset to that value.
But all of your threads in y
are hammering on the same location for max value (max[x*2 + 0]
). This isn't recommended, as there is no way to sort out a race condition. You should use atomic operations, or other methods (e.g. reduction) to handle multiple threads updating a single max value this way.
Since you have a need to update two values atomically (the max value and it's location), it's not a simple matter of replacing your plain access with a standard atomic function. However, since you are dealing with two 32-bit adjacent quantities, you may be interested in my answer here.
By the way I think matlab's native matrix multiply on gpuArray
should be faster than any matrix multiply code you write. But it would require the Parallel Compute Toolbox.