I want to implement a Inter-block barrier on CUDA, but encountering a serious problem.
I cannot figure out why it does not work.
#include
Block to block synchronization is possible. See this paper.
The paper doesn't go into great detail on how it works, but it relies on the operation of __syncthreads(); to create the pause-barrier for the current block,... while waiting for the other blocks to get to the sync point.
One item that isn't noted in the paper is that sync is only possible if the number of blocks is small enough or the number of SM's is large enough for the task on hand. i.e. If you have 4 SM's and are trying to sync 5 blocks,.. the kernel will deadlock.
With their approach, I've been able to spread a long serial task among many blocks, easily saving 30% time over a single block approach. i.e. The block-sync worked for me.
Looks like compiler optimizations issue. I'm not good with reading PTX-code, but it looks like the compiler have omitted the while
-loop at all (even when compiled with -O0
):
.loc 3 41 0
cvt.u64.u32 %rd7, %ctaid.x; // Save blockIdx.x to rd7
ld.param.u64 %rd8, [__cudaparm__Z3sumPiS_S_7Barrier_cache];
mov.s32 %r8, %ctaid.x; // Now calculate ouput address
mul.wide.u32 %rd9, %r8, 4;
add.u64 %rd10, %rd8, %rd9;
st.global.s32 [%rd10+0], %r5; // Store result to cache[blockIdx.x]
.loc 17 128 0
ld.param.u64 %rd11, [__cudaparm__Z3sumPiS_S_7Barrier_barrier+0]; // Get *count to rd11
mov.s32 %r9, -1; // put -1 to r9
atom.global.add.s32 %r10, [%rd11], %r9; // Do AtomicSub, storing the result to r10 (will be unused)
cvt.u32.u64 %r11, %rd7; // Put blockIdx.x saved in rd7 to r11
mov.u32 %r12, 0; // Put 0 to r12
setp.ne.u32 %p3, %r11, %r12; // if(blockIdx.x == 0)
@%p3 bra $Lt_0_5122;
ld.param.u64 %rd12, [__cudaparm__Z3sumPiS_S_7Barrier_sum];
ld.global.s32 %r13, [%rd12+0];
mov.s64 %rd13, %rd8;
mov.s32 %r14, 0;
In case of CPU code, such behavior is prevented by declaring the variable with volatile
prefix. But even if we declare count
as int __device__ count
(and appropriately change the code), adding volatile
specifier just breaks compilation (with errors loke argument of type "volatile int *" is incompatible with parameter of type "void *"
)
I suggest looking at threadFenceReduction example from CUDA SDK. There they are doing pretty much the same as you do, but the block to do final summation is chosen in runtime, rather than predefined, and the while
-loop is eliminated, because spin-lock on global variable should be very slow.
Unfortunately, what you want to achieve (inter-block communication/synchronization) isn't strictly possible in CUDA. The CUDA programming guide states that "thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series." The reason for this restriction is to allow flexibility in the thread block scheduler, and to allow the code to agnostically scale with the number of cores. The only supported inter-block synchronization method is to launch another kernel: kernel launches (within the same stream) are implicit synchronization points.
Your code violates the block independence rule because it implicitly assumes that your kernel's thread blocks execute concurrently (cf. in parallel). But there's no guarantee that they do. To see why this matters to your code, let's consider a hypothetical GPU with only one core. We'll also assume that you only want to launch two thread blocks. Your spinloop kernel will actually deadlock in this situation. If thread block zero is scheduled on the core first, it will loop forever when it gets to the barrier, because thread block one never has a chance to update the counter. Because thread block zero is never swapped out (thread blocks execute to their completion) it starves thread block one of the core while it spins.
Some folks have tried schemes such as yours and have seen success because the scheduler happened to serendipitously schedule blocks in such a way that the assumptions worked out. For example, there was a time when launching as many thread blocks as a GPU has SMs meant that the blocks were truly executed concurrently. But they were disappointed when a change to the driver or CUDA runtime or GPU invalidated that assumption, breaking their code.
For your application, try to find a solution which doesn't depend on inter-block synchronization, because (barring a signification change to the CUDA programming model) it just isn't possible.