Processing Shared Work Queue Using CUDA Atomic Operations and Grid Synchronization
问题 I’m trying to write a kernel whose threads iteratively process items in a work queue. My understanding is that I should be able to do this by using atomic operations to manipulate the work queue (i.e., grab work items from the queue and insert new work items into the queue), and using grid synchronization via cooperative groups to ensure all threads are at the same iteration (I ensure the number of thread blocks doesn’t exceed the device capacity for the kernel). However, sometimes I observe