CUDA, mutex and atomicCAS()

后端 未结 1 559
有刺的猬
有刺的猬 2020-12-05 12:20

Recently I started to develop on CUDA and faced with the problem with atomicCAS(). To do some manipulations with memory in device code I have to create a mutex, so that only

相关标签:
1条回答
  • 2020-12-05 12:48

    The loop in question

    do 
    {
        atomicCAS(mutex, 0, 1 + i);
    }
    while (*mutex != i + 1);
    

    would work fine if it were running on the host (CPU) side; once thread 0 sets *mutex to 1, the other threads would wait exactly until thread 0 sets *mutex back to 0.

    However, GPU threads are not as independent as their CPU counterparts. GPU threads are grouped into groups of 32, commonly referred to as warps. Threads in the same warp will execute instructions in complete lock-step. If a control statement such as if or while causes some of the 32 threads to diverge from the rest, the remaining threads will wait (i.e. sleeps) for the divergent threads to finish. [1]

    Going back to the loop in question, thread 0 becomes inactive because threads 1, 2, ..., 31 are still stuck in the while loop. So thread 0 never reaches the line *mutex = 0, and the other 31 threads loops forever.

    A potential solution is to make a local copy of the shared resource in question, let 32 threads modify the copy, and then pick one thread to 'push' the change back to the shared resource. A __shared__ variable is ideal in this situation: it will be shared by the threads belonging to the same block but not other blocks. We can use __syncthreads() to fine-control the access of this variable by the member threads.

    [1] CUDA Best Practices Guide - Branching and Divergence

    Avoid different execution paths within the same warp.

    Any flow control instruction (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge; that is, to follow different execution paths. If this happens, the different execution paths must be serialized, since all of the threads of a warp share a program counter; this increases the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path.

    0 讨论(0)
提交回复
热议问题