I was just reading:
Efficiently dividing unsigned value by a power of two, rounding up
and I was wondering what was the fastest way to do this in CUDA. Of course
Here is an alternative solution via population count. I tried the 32-bit variant only, testing it exhaustively against the reference implementation. Since the divisor q
is a power of 2, we can trivially determine the shift count s
with the help of the population count operation. The remainder t
of the truncating division can be computed by simple mask m
derived directly from the divisor q
.
// For p in [0,0xffffffff], q = (1 << s) with s in [0,31], compute ceil(p/q)
__device__ uint32_t reference (uint32_t p, uint32_t q)
{
uint32_t r = p / q;
if ((q * r) < p) r++;
return r;
}
// For p in [0,0xffffffff], q = (1 << s) with s in [0,31], compute ceil(p/q)
__device__ uint32_t solution (uint32_t p, uint32_t q)
{
uint32_t r, s, t, m;
m = q - 1;
s = __popc (m);
r = p >> s;
t = p & m;
if (t > 0) r++;
return r;
}
Whether solution()
is faster than the previously posted codes will likely depend on the specific GPU architecture. Using CUDA 8.0, it compiles to the following sequence of PTX instructions:
add.s32 %r3, %r2, -1;
popc.b32 %r4, %r3;
shr.u32 %r5, %r1, %r4;
and.b32 %r6, %r3, %r1;
setp.ne.s32 %p1, %r6, 0;
selp.u32 %r7, 1, 0, %p1;
add.s32 %r8, %r5, %r7;
For sm_5x, this translates into machine code pretty much 1:1, except that the two instructions SETP
and SELP
get contracted into a single ICMP
, because the comparison is with 0.