opencl- parallel reduction without local memory

不羁的心 提交于 2019-12-11 11:12:47

问题


Most of the algorithms for parallel reduction uses shared(local) memory.

Nvidia,AMD, Intel and so on.

But if devices has doesn't have shared(local) memory.

How can I do it?

If i use same algorithms but store temporary value on global memory, is it gonna be work fine?


回答1:


If I think about it, my comment already was the complete answer.

Yes, you can use global memory as a replacement for local memory but:

  • you have to allocate enough global memory for all workgroups and assign the workgroups their chunk of memory (since with local memory, you only have to specifiy as much memory as is needed for a single workgroup and each workgroup will allocate the amount of memory specified)
  • you have to use CLK_GLOBAL_MEM_FENCE instead of CLK_LOCAL_MEM_FENCE
  • you will lose a significant amout of performance

If I have time this evening, I will post a simple example.




回答2:


If the device supports OpenCL 2.0 then work_group_reduce can be used:

gentype work_group_reduce< op > ( gentype x)

The < op> in work_group_reduce_< op>, work_group_scan_exclusive_< op> and work_group_scan_inclusive_< op> defines the operator and can be add, min or max.



来源:https://stackoverflow.com/questions/32393208/opencl-parallel-reduction-without-local-memory

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!