What's the most efficient way to calculate the warp id / lane id in a 1-D grid?

后端 未结 2 682
逝去的感伤
逝去的感伤 2021-02-04 15:15

In CUDA, each thread knows its block index in the grid and thread index within the block. But two important values do not seem to be explicitly available to it:

  • It
2条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2021-02-04 15:21

    The other answer is very dangerous! Compute the lane-id and warp-id yourself.

    #include 
    #include 
    
    inline __device__ unsigned get_lane_id() {
      unsigned ret;
      asm volatile("mov.u32 %0, %laneid;" : "=r"(ret));
      return ret;
    }
    
    inline __device__ unsigned get_warp_id() {
      unsigned ret;
      asm volatile("mov.u32 %0, %warpid;" : "=r"(ret));
      return ret;
    }
    
    __global__ void kernel() {
      const int actual_warpid = get_warp_id();
      const int actual_laneid = get_lane_id();
      const int expected_warpid = threadIdx.x / 32;
      const int expected_laneid = threadIdx.x % 32;
      if (expected_laneid == 0) {
        printf("[warp:] actual: %i  expected: %i\n", actual_warpid,
               expected_warpid);
        printf("[lane:] actual: %i  expected: %i\n", actual_laneid,
               expected_laneid);
      }
    }
    
    int main(int argc, char const *argv[]) {
      dim3 grid(8, 7, 1);
      dim3 block(4 * 32, 1);
    
      kernel<<>>();
      cudaDeviceSynchronize();
      return 0;
    }
    

    which gives something like

    [warp:] actual: 4  expected: 3
    [warp:] actual: 10  expected: 0
    [warp:] actual: 1  expected: 1
    [warp:] actual: 12  expected: 1
    [warp:] actual: 4  expected: 3
    [warp:] actual: 0  expected: 0
    [warp:] actual: 13  expected: 2
    [warp:] actual: 12  expected: 1
    [warp:] actual: 6  expected: 1
    [warp:] actual: 6  expected: 1
    [warp:] actual: 13  expected: 2
    [warp:] actual: 10  expected: 0
    [warp:] actual: 1  expected: 1
    ...
    [lane:] actual: 0  expected: 0
    [lane:] actual: 0  expected: 0
    [lane:] actual: 0  expected: 0
    [lane:] actual: 0  expected: 0
    [lane:] actual: 0  expected: 0
    [lane:] actual: 0  expected: 0
    [lane:] actual: 0  expected: 0
    [lane:] actual: 0  expected: 0
    [lane:] actual: 0  expected: 0
    [lane:] actual: 0  expected: 0
    [lane:] actual: 0  expected: 0
    

    see also the PTX docs

    A predefined, read-only special register that returns the thread's warp identifier. The warp identifier provides a unique warp number within a CTA but not across CTAs within a grid. The warp identifier will be the same for all threads within a single warp.

    Note that %warpid is volatile and returns the location of a thread at the moment when read, but its value may change during execution, e.g., due to rescheduling of threads following preemption.

    Hence, it is the warp-id of the scheduler without any guarantee that it matches the virtual warp-id (started by counting from 0).

    The docs makes this clear:

    For this reason, %ctaid and %tid should be used to compute a virtual warp index if such a value is needed in kernel code; %warpid is intended mainly to enable profiling and diagnostic code to sample and log information such as work place mapping and load distribution.

    If you think, ok let's use CUB for this: This even affects cub::WarpId()

    Returns the warp ID of the calling thread. Warp ID is guaranteed to be unique among warps, but may not correspond to a zero-based ranking within the thread block.

    EDIT: Using %laneid seems to be safe.

提交回复
热议问题