Does __syncthreads() synchronize all threads in the grid?

前端 未结 5 1056
栀梦
栀梦 2021-02-02 05:43

...or just the threads in the current warp or block?

Also, when the threads in a particular block encounter (in the kernel) the following line

__shared__         


        
相关标签:
5条回答
  • 2021-02-02 06:21

    __syncthreads() waits until all threads within the same block has reached the command and all threads within a warp - that means all warps that belongs to a threadblock must reach the statement.

    If you declare shared memory in a kernel, the array will only be visible to one threadblock. So each block will have his own shared memory block.

    0 讨论(0)
  • 2021-02-02 06:27

    Existing answers have done a great job answering how __syncthreads() works (it allows intra-block synchronization), I just wanted to add an update that there are now newer methods for inter-block synchronization. Since CUDA 9.0, "Cooperative Groups" have been introduced, which allow synchronizing an entire grid of blocks (as explained in the Cuda Programming Guide). This achieves the same functionality as launching a new kernel (as mentioned above), but can usually do so with lower overhead and make your code more readable.

    0 讨论(0)
  • 2021-02-02 06:28

    The __syncthreads() command is a block level synchronization barrier. That means it is safe to be used when all threads in a block reach the barrier. It is also possible to use __syncthreads() in conditional code but only when all threads evaluate identically such code otherwise the execution is likely to hang or produce unintended side effects [4].

    Example of using __syncthreads(): (source)

    __global__ void globFunction(int *arr, int N) 
    {
        __shared__ int local_array[THREADS_PER_BLOCK];  //local block memory cache           
        int idx = blockIdx.x* blockDim.x+ threadIdx.x;
    
        //...calculate results
        local_array[threadIdx.x] = results;
    
        //synchronize the local threads writing to the local memory cache
        __syncthreads();
    
        // read the results of another thread in the current thread
        int val = local_array[(threadIdx.x + 1) % THREADS_PER_BLOCK];
    
        //write back the value to global memory
        arr[idx] = val;        
    }
    

    To synchronize all threads in a grid currently there is not native API call. One way of synchronizing threads on a grid level is using consecutive kernel calls as at that point all threads end and start again from the same point. It is also commonly called CPU synchronization or Implicit synchronization. Thus they are all synchronized.

    Example of using this technique (source):

    CPU synchronization

    Regarding the second question. Yes, it does declare the amount of shared memory specified per block. Take into account that the quantity of available shared memory is measured per SM. So one should be very careful how the shared memory is used along with the launch configuration.

    0 讨论(0)
  • 2021-02-02 06:32

    I agree with all the answers here but I think we are missing one important point here w.r.t first question. I am not answering second answer as it got answered perfectly in the above answers.

    Execution on GPU happens in units of warps. A warp is a group of 32 threads and at one time instance each thread of a particular warp execute the same instruction. If you allocate 128 threads in a block its (128/32 = ) 4 warps for a GPU.

    Now the question becomes "If all threads are executing the same instruction then why synchronization is needed?". The answer is we need to synchronize the warps that belong to the SAME block. __syncthreads does not synchronizes threads in a warp, they are already synchronized. It synchronizes warps that belong to same block.

    That is why answer to your question is : __syncthreads does not synchronizes all threads in a grid, but the threads belonging to one block as each block executes independently.

    If you want to synchronize a grid then divide your kernel (K) into two kernels(K1 and K2) and call both. They will be synchronized (K2 will be executed after K1 finishes).

    0 讨论(0)
  • 2021-02-02 06:43

    In order to provide further details, aside of the answers, quoting seibert:

    More generally, __syncthreads() is a barrier primitive designed to protect you from read-after-write memory race conditions within a block.

    The rules of use are pretty simple:

    1. Put a __syncthreads() after the write and before the read when there is a possibility of a thread reading a memory location that another thread has written to.

    2. __syncthreads() is only a barrier within a block, so it cannot protect you from read-after-write race conditions in global memory unless the only possible conflict is between threads in the same block. __syncthreads() is pretty much always used to protect shared memory read-after-write.

    3. Do not use a __syncthreads() call in a branch or a loop until you are sure every single thread will reach the same __syncthreads() call. This can sometimes require that you break your if-blocks into several pieces to put __syncthread() calls at the top-level where all threads (including those which failed the if predicate) will execute them.

    4. When looking for read-after-write situations in loops, it helps to unroll the loop in your head when figuring out where to put __syncthread() calls. For example, you often need an extra __syncthreads() call at the end of the loop if there are reads and writes from different threads to the same shared memory location in the loop.

    5. __syncthreads() does not mark a critical section, so don’t use it like that.

    6. Do not put a __syncthreads() at the end of a kernel call. There’s no need for it.

    7. Many kernels do not need __syncthreads() at all because two different threads never access the same memory location.

    0 讨论(0)
提交回复
热议问题