Efficiently dividing unsigned value by a power of two, rounding up - in CUDA

前端 未结 6 1128
被撕碎了的回忆
被撕碎了的回忆 2021-01-21 09:28

I was just reading:

Efficiently dividing unsigned value by a power of two, rounding up

and I was wondering what was the fastest way to do this in CUDA. Of course

6条回答
  •  鱼传尺愫
    2021-01-21 10:25

    Here is an alternative solution via population count. I tried the 32-bit variant only, testing it exhaustively against the reference implementation. Since the divisor q is a power of 2, we can trivially determine the shift count s with the help of the population count operation. The remainder t of the truncating division can be computed by simple mask m derived directly from the divisor q.

    // For p in [0,0xffffffff], q = (1 << s) with s in [0,31], compute ceil(p/q)
    __device__ uint32_t reference (uint32_t p, uint32_t q)
    {
        uint32_t r = p / q;
        if ((q * r) < p) r++;
        return r;
    }
    
    // For p in [0,0xffffffff], q = (1 << s) with s in [0,31], compute ceil(p/q)
    __device__ uint32_t solution (uint32_t p, uint32_t q)
    {
        uint32_t r, s, t, m;
        m = q - 1;
        s = __popc (m);
        r = p >> s;
        t = p & m;
        if (t > 0) r++;
        return r;
    }
    

    Whether solution() is faster than the previously posted codes will likely depend on the specific GPU architecture. Using CUDA 8.0, it compiles to the following sequence of PTX instructions:

    add.s32         %r3, %r2, -1;
    popc.b32        %r4, %r3;
    shr.u32         %r5, %r1, %r4;
    and.b32         %r6, %r3, %r1;
    setp.ne.s32     %p1, %r6, 0;
    selp.u32        %r7, 1, 0, %p1;
    add.s32         %r8, %r5, %r7;
    

    For sm_5x, this translates into machine code pretty much 1:1, except that the two instructions SETP and SELP get contracted into a single ICMP, because the comparison is with 0.

提交回复
热议问题