Efficiently dividing unsigned value by a power of two, rounding up - in CUDA

前端未结

关注

 6  1128

被撕碎了的回忆 2021-01-21 09:28

I was just reading:

Efficiently dividing unsigned value by a power of two, rounding up

and I was wondering what was the fastest way to do this in CUDA. Of course

6条回答

鱼传尺愫 (楼主)

2021-01-21 10:25
Here is an alternative solution via population count. I tried the 32-bit variant only, testing it exhaustively against the reference implementation. Since the divisor q is a power of 2, we can trivially determine the shift count s with the help of the population count operation. The remainder t of the truncating division can be computed by simple mask m derived directly from the divisor q.
```
// For p in [0,0xffffffff], q = (1 << s) with s in [0,31], compute ceil(p/q)
__device__ uint32_t reference (uint32_t p, uint32_t q)
{
    uint32_t r = p / q;
    if ((q * r) < p) r++;
    return r;
}

// For p in [0,0xffffffff], q = (1 << s) with s in [0,31], compute ceil(p/q)
__device__ uint32_t solution (uint32_t p, uint32_t q)
{
    uint32_t r, s, t, m;
    m = q - 1;
    s = __popc (m);
    r = p >> s;
    t = p & m;
    if (t > 0) r++;
    return r;
}
```
Whether solution() is faster than the previously posted codes will likely depend on the specific GPU architecture. Using CUDA 8.0, it compiles to the following sequence of PTX instructions:
```
add.s32         %r3, %r2, -1;
popc.b32        %r4, %r3;
shr.u32         %r5, %r1, %r4;
and.b32         %r6, %r3, %r1;
setp.ne.s32     %p1, %r6, 0;
selp.u32        %r7, 1, 0, %p1;
add.s32         %r8, %r5, %r7;
```
For sm_5x, this translates into machine code pretty much 1:1, except that the two instructions SETP and SELP get contracted into a single ICMP, because the comparison is with 0.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...