问题
First of all, here is the link to the algorithm:
GPU Gems 3, Chapter 39: Parallel Prefix Sum (Scan) with CUDA.
In order to avoid bank conflicts, padding is added to the shared memory array every NUM_BANKS (i.e., 32 for devices of computability 2.x) elements. This is done by (as in Figure 39-5):
int ai = offset*(2*thid+1)-1
int bi = offset*(2*thid+2)-1
ai += ai/NUM_BANKS
bi += ai/NUM_BANKS
temp[bi] += temp[ai]
I don't understand how ai/NUM_BANKS is equivalent to the macro:
#define NUM_BANKS 16
#define LOG_NUM_BANKS 4
#define CONFLICT_FREE_OFFSET(n) \
((n) >> NUM_BANKS + (n) >> (2 * LOG_NUM_BANKS))
Isn't it equal to
n >> LOG_NUM_BANKS
Any help is appreciated. Thanks
回答1:
I wrote that code and co-wrote the article, and I request that you use the article only for learning about scan algorithms, and do not use the code in it. It was written when CUDA was new, and I was new to CUDA. If you use a modern implementation of scan in CUDA you don't need any bank conflict avoidance.
If you want to do scans the easy way, use thrust::inclusive_scan or thrust::exclusive_scan.
If you really want to implement a scan, refer to more recent articles such as this one [1]. Or for a real opus with faster code but that will require a bit more study, this one [2]. Or read Sean Baxter's tutorial (though the latter doesn't include citations of the seminal work on the scan algorithm).
[1] Shubhabrata Sengupta, Mark Harris, Michael Garland, and John D. Owens. "Efficient Parallel Scan Algorithms for many-core GPUs". In Jakub Kurzak, David A. Bader, and Jack Dongarra, editors, Scientific Computing with Multicore and Accelerators, Chapman & Hall/CRC Computational Science, chapter 19, pages 413–442. Taylor & Francis, January 2011. http://www.idav.ucdavis.edu/publications/print_pub?pub_id=1041
[2] Merrill, D. and Grimshaw, A. Parallel Scan for Stream Architectures. Technical Report CS2009-14, Department of Computer Science, University of Virginia. Dec. 2009.
来源:https://stackoverflow.com/questions/9689185/conflict-free-offset-macro-used-in-the-parallel-prefix-algorithm-from-gpu-gems-3