How can I enforce CUDA global memory coherence without declaring pointer as volatile?
问题 I'll first do some contextualization. I'm trying to implement a non-blocking work stealing method using deques in CUDA. The deques (aDeques) are in a block-segmented array in global memory and the popWork() device function has the objective of popping work to feed threads. In addition of the global deques, each block has a stack in shared memory (aLocalStack) where it can locally work. The pop occurs in 3 levels. First attempt is in the shared stack, second attempt is in the deque owned by