How can I count the number of leading zeros in a 128-bit integer (uint128_t
) efficiently?
I know GCC\'s built-in functions:
__builti
Assuming a 'random' distribution, the first non-zero bit will be in the high 64 bits, with an overwhelming probability, so it makes sense to test that half first.
Have a look at the code generated for:
/* inline */ int clz_u128 (uint128_t u)
{
unsigned long long hi, lo; /* (or uint64_t) */
int b = 128;
if ((hi = u >> 64) != 0) {
b = __builtin_clzll(hi);
}
else if ((lo = u & ~0ULL) != 0) {
b = __builtin_clzll(lo) + 64;
}
return b;
}
I would expect gcc to implement each __builtin_clzll
using the bsrq
instruction - bit scan reverse, i.e., most-significant bit position - in conjunction with an xor
, (msb ^ 63)
, or sub
, (63 - msb)
, to turn it into a leading zero count. gcc might generate lzcnt
instructions with the right -march=
(architecture) options.
Edit: others have pointed out that the 'distribution' is not relevant in this case, since the HI uint64_t needs to be tested regardless.
inline int clz_u128 (uint128_t u) {
uint64_t hi = u>>64;
uint64_t lo = u;
int retval[3]={
__builtin_clzll(hi),
__builtin_clzll(lo)+64,
128
};
int idx = !hi + ((!lo)&(!hi));
return retval[idx];
}
this is a branch free variant. Note that more work is done than in the branchy solution, and in practice the branching will probably be predictable.
It also relies on __builtin_clzll
not crashing when fed 0: the docs say the result is undefined, but is it just unspecified or undefined?
Yakk's answer works well for all kinds of targets as long as gcc supports 128 bit integers for the target. However, note that on the x86-64 platform, with an Intel Haswell processor or newer, there is a more efficient solution:
#include <immintrin.h>
#include <stdint.h>
// tested with compiler options: gcc -O3 -Wall -m64 -mlzcnt
inline int lzcnt_u128 (unsigned __int128 u) {
uint64_t hi = u>>64;
uint64_t lo = u;
lo = (hi == 0) ? lo : -1ULL;
return _lzcnt_u64(hi) + _lzcnt_u64(lo);
}
The _lzcnt_u64 intrinsic compiles (gcc 5.4) to the lzcnt instruction, which is well defined for a zero input (it returns 64), in contrary to gcc's __builtin_clzll(). The ternary operator compiles to the cmove instruction.