CUDA: reduction or atomic operations?

后端 未结 7 1390
眼角桃花
眼角桃花 2021-01-14 19:00

I\'m writing a CUDA kernel which involves calculating the maximum value on a given matrix and I\'m evaluating possibilities. The best way I could find is:

Forcing ev

相关标签:
7条回答
  • 2021-01-14 19:07

    NVIDIA has a CUDA demo that does reduction: here. There's a whitepaper that goes along with it that explains some motivations behind the design.

    0 讨论(0)
  • 2021-01-14 19:09

    The atomicAdd function could also be used, but it is much less efficient than the approaches mentioned above. http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-operations/

    0 讨论(0)
  • 2021-01-14 19:10

    You may also want to use the reduction routines that comes w/ CUDA Thrust which is a part of CUDA 4.0 or available here.

    The library is written by a pair of nVidia engineers and compares favorably with heavily hand optimized code. I believe there is also some auto-tuning of grid/block size going on.

    You can interface with your own kernel easily by wrapping your raw device pointers.

    This is strictly from a rapid integration point of view. For the theory, see tkerwin's answer.

    0 讨论(0)
  • 2021-01-14 19:19

    Actually, the problem you described is not really about matrices. The two-dimensional view of the input data is not significant (assuming the matrix data is layed out contiguously in memory). It's just a reduction over a sequence of values, being all matrix elements in whatever order they appear in memory.

    Assuming the matrix representation is contiguous in memory, you just want to perform a simple reduction. And the best available implementation these days - as far as I can tell - is the excellent libcub by nVIDIA's Duane Merill. Here is the documentation on its device-wide Maximum-calculating function.

    Note, though, that unless the matrix is small, for most of the computation it will simply be threads reading data and updating their own thread-specific maximum. Only when a thread has finished reading through a large swatch of the matrix (or rather, a large strided swath) will it write its local maximum anywhere - typically into shared memory for a block-level reduction. And as for atomics, you will probably be making an atomicMax() call once every obscenely large number of matrix element reads - tens of thousands if not more.

    0 讨论(0)
  • 2021-01-14 19:19

    If you have K20 or Titan, I suggest dynamic parallelism: lunching a single thread kernel, which lunches #items worker kernel threads to produce data, then lunches #items/first-round-reduction-factor threads for first round reduction, and keep lunching till result coming out.

    0 讨论(0)
  • 2021-01-14 19:21

    I found this document very useful for learning the basics of parallel reduction with CUDA. It's kind of old, so there must be additional tricks to boost performance further.

    0 讨论(0)
提交回复
热议问题