问题
Most of the algorithms for parallel reduction uses shared(local) memory.
Nvidia,AMD, Intel and so on.
But if devices has doesn't have shared(local) memory.
How can I do it?
If i use same algorithms but store temporary value on global memory, is it gonna be work fine?
回答1:
If I think about it, my comment already was the complete answer.
Yes, you can use global memory as a replacement for local memory but:
- you have to allocate enough global memory for all workgroups and assign the workgroups their chunk of memory (since with local memory, you only have to specifiy as much memory as is needed for a single workgroup and each workgroup will allocate the amount of memory specified)
- you have to use CLK_GLOBAL_MEM_FENCE instead of CLK_LOCAL_MEM_FENCE
- you will lose a significant amout of performance
If I have time this evening, I will post a simple example.
回答2:
If the device supports OpenCL 2.0 then work_group_reduce can be used:
gentype work_group_reduce< op > ( gentype x)
The < op> in work_group_reduce_< op>, work_group_scan_exclusive_< op> and work_group_scan_inclusive_< op> defines the operator and can be add, min or max.
来源:https://stackoverflow.com/questions/32393208/opencl-parallel-reduction-without-local-memory