CUDA: reduction or atomic operations?

后端未结

关注

 7  1390

I\'m writing a CUDA kernel which involves calculating the maximum value on a given matrix and I\'m evaluating possibilities. The best way I could find is:

Forcing ev

相关标签:

7条回答

清酒与你

2021-01-14 19:07

NVIDIA has a CUDA demo that does reduction: here. There's a whitepaper that goes along with it that explains some motivations behind the design.

0 讨论(0)
发布评论:

提交评论
- 加载中...
一生所求

2021-01-14 19:09

The atomicAdd function could also be used, but it is much less efficient than the approaches mentioned above. http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-operations/

0 讨论(0)
发布评论:

提交评论
- 加载中...
暗喜

2021-01-14 19:10

You may also want to use the reduction routines that comes w/ CUDA Thrust which is a part of CUDA 4.0 or available here.

The library is written by a pair of nVidia engineers and compares favorably with heavily hand optimized code. I believe there is also some auto-tuning of grid/block size going on.

You can interface with your own kernel easily by wrapping your raw device pointers.

This is strictly from a rapid integration point of view. For the theory, see tkerwin's answer.

0 讨论(0)
发布评论:

提交评论
- 加载中...
旧巷少年郎

2021-01-14 19:19

Actually, the problem you described is not really about matrices. The two-dimensional view of the input data is not significant (assuming the matrix data is layed out contiguously in memory). It's just a reduction over a sequence of values, being all matrix elements in whatever order they appear in memory.

Assuming the matrix representation is contiguous in memory, you just want to perform a simple reduction. And the best available implementation these days - as far as I can tell - is the excellent libcub by nVIDIA's Duane Merill. Here is the documentation on its device-wide Maximum-calculating function.

Note, though, that unless the matrix is small, for most of the computation it will simply be threads reading data and updating their own thread-specific maximum. Only when a thread has finished reading through a large swatch of the matrix (or rather, a large strided swath) will it write its local maximum anywhere - typically into shared memory for a block-level reduction. And as for atomics, you will probably be making an atomicMax() call once every obscenely large number of matrix element reads - tens of thousands if not more.

0 讨论(0)
发布评论:

提交评论
- 加载中...
耶瑟儿～

2021-01-14 19:19

If you have K20 or Titan, I suggest dynamic parallelism: lunching a single thread kernel, which lunches #items worker kernel threads to produce data, then lunches #items/first-round-reduction-factor threads for first round reduction, and keep lunching till result coming out.

0 讨论(0)
发布评论:

提交评论
- 加载中...
心在旅途

2021-01-14 19:21

I found this document very useful for learning the basics of parallel reduction with CUDA. It's kind of old, so there must be additional tricks to boost performance further.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页