I am thinking about reworking my GPU OpenCL kernel to speed things up. The problem is there is a lot of global memory that is not coalesced and fetches are really bringing d
I am not able to understand you question properly , but if you have large global access and if those are re-used than use use local memory.
Note:small local work size less data shared so no use, large local work size less parallel threads . So you need to select the best one.