In CUDA, each thread knows its block index in the grid and thread index within the block. But two important values do not seem to be explicitly available to it:
Note: This answer has been heavily edited.
It is very tempting to try and avoid the computation altogether - as these two values seem to already be available if you look under the hood.
You see, nVIDIA GPUs have special registers which your (compiled) code can read to access various kinds of useful information. One such register holds threadIdx.x
; another holds blockDim.x
; another - the clock tick count; and so on. C++ as a language does not have these exposed, obviously; and, in fact, neither does CUDA. However, the intermediary representation into which CUDA code is compiled, named PTX, does expose these special registers (since PTX 1.3, i.e. with CUDA versions >= 2.1).
Two of these special registers are %warpid
and %laneid
. Now, CUDA supports inlining PTX code within CUDA code with the asm
keyword - just like it can be used for host-side code to emit CPU assembly instructions directly. With this mechanism one can use these special registers:
__forceinline__ __device__ unsigned lane_id()
{
unsigned ret;
asm volatile ("mov.u32 %0, %laneid;" : "=r"(ret));
return ret;
}
__forceinline__ __device__ unsigned warp_id()
{
// this is not equal to threadIdx.x / 32
unsigned ret;
asm volatile ("mov.u32 %0, %warpid;" : "=r"(ret));
return ret;
}
... but there are two problems here.
The first problem - as @Patwie suggests - is that %warp_id
does not give you what you actually want - it's not the index of the warp in the context of the grid, but rather in the context of the physical SM (which can hold so many warps resident at a time), and those two are not the same. So don't use %warp_id
.
As for %lane_id
, it does give you the correct value, but it's misleadingly non-performant: Even though it's a "register", it's not like the regular registers in your register file, with 1-cycle access latency. It's a special register, which in the actual hardware is retrieved using an S2R instruction, which can exhibit long latency.
Bottom line: Just compute the warp ID and thread ID from the thread ID. We can't get around this - for now.