I\'m writing a CUDA kernel which involves calculating the maximum value on a given matrix and I\'m evaluating possibilities. The best way I could find is:
Forcing ev
This is the usual way to perform reductions in CUDA
Within each block,
1) Keep a running reduced value in shared memory for each thread. Hence each thread will read n (I personally favor between 16 and 32), values from global memory and updates the reduced value from these
2) Perform the reduction algorithm within the block to get one final reduced value per block.
This way you will not need more shared memory than (number of threads) * sizeof (datatye) bytes.
Since each block a reduced value, you will need to perform a second reduction pass to get the final value.
For example, if you are launching 256 threads per block, and are reading 16 values per thread, you will be able to reduce (256 * 16 = 4096) elements per block.
So given 1 million elements, you will need to launch around 250 blocks in the first pass, and just one block in the second.
You will probably need a third pass for cases when the number of elements > (4096)^2 for this configuration.
You will have to take care that the global memory reads are coalesced. You can not coalesce global memory writes, but that is one performance hit you need to take.