I am looking for an efficient way to determine the position of the least significant bit that is set in an integer, e.g. for 0x0FF0 it would be 4.
A trivial impleme
The fastest (non-intrinsic/non-assembler) solution to this is to find the lowest-byte and then use that byte in a 256-entry lookup table. This gives you a worst-case performance of four conditional instructions and a best-case of 1. Not only is this the least amount of instructions, but the least amount of branches which is super-important on modern hardware.
Your table (256 8-bit entries) should contain the index of the LSB for each number in the range 0-255. You check each byte of your value and find the lowest non-zero byte, then use this value to lookup the real index.
This does require 256-bytes of memory, but if the speed of this function is so important then that 256-bytes is well worth it,
E.g.
byte lowestBitTable[256] = {
.... // left as an exercise for the reader to generate
};
unsigned GetLowestBitPos(unsigned value)
{
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
byte* bytes = (byte*)value;
if (bytes[0])
return lowestBitTable[bytes[0]];
else if (bytes[1])
return lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
return lowestBitTable[bytes[2]] + 16;
else
return lowestBitTable[bytes[3]] + 24;
}
Found this clever trick using 'magic masks' in "The art of programming, part 4", which does it in O(log(n)) time for n-bit number. [with log(n) extra space]. Typical solutions checking for the set bit is either O(n) or need O(n) extra space for a look up table, so this is a good compromise.
Magic masks:
m0 = (...............01010101)
m1 = (...............00110011)
m2 = (...............00001111)
m3 = (.......0000000011111111)
....
Key idea: No of trailing zeros in x = 1 * [(x & m0) = 0] + 2 * [(x & m1) = 0] + 4 * [(x & m2) = 0] + ...
int lastSetBitPos(const uint64_t x) {
if (x == 0) return -1;
//For 64 bit number, log2(64)-1, ie; 5 masks needed
int steps = log2(sizeof(x) * 8); assert(steps == 6);
//magic masks
uint64_t m[] = { 0x5555555555555555, // .... 010101
0x3333333333333333, // .....110011
0x0f0f0f0f0f0f0f0f, // ...00001111
0x00ff00ff00ff00ff, //0000000011111111
0x0000ffff0000ffff,
0x00000000ffffffff };
//Firstly extract only the last set bit
uint64_t y = x & -x;
int trailZeros = 0, i = 0 , factor = 0;
while (i < steps) {
factor = ((y & m[i]) == 0 ) ? 1 : 0;
trailZeros += factor * pow(2,i);
++i;
}
return (trailZeros+1);
}
Yet another solution, not the fastest possibly, but seems quite good.
At least it has no branches. ;)
uint32 x = ...; // 0x00000001 0x0405a0c0 0x00602000
x |= x << 1; // 0x00000003 0x0c0fe1c0 0x00e06000
x |= x << 2; // 0x0000000f 0x3c3fe7c0 0x03e1e000
x |= x << 4; // 0x000000ff 0xffffffc0 0x3fffe000
x |= x << 8; // 0x0000ffff 0xffffffc0 0xffffe000
x |= x << 16; // 0xffffffff 0xffffffc0 0xffffe000
// now x is filled with '1' from the least significant '1' to bit 31
x = ~x; // 0x00000000 0x0000003f 0x00001fff
// now we have 1's below the original least significant 1
// let's count them
x = x & 0x55555555 + (x >> 1) & 0x55555555;
// 0x00000000 0x0000002a 0x00001aaa
x = x & 0x33333333 + (x >> 2) & 0x33333333;
// 0x00000000 0x00000024 0x00001444
x = x & 0x0f0f0f0f + (x >> 4) & 0x0f0f0f0f;
// 0x00000000 0x00000006 0x00000508
x = x & 0x00ff00ff + (x >> 8) & 0x00ff00ff;
// 0x00000000 0x00000006 0x0000000d
x = x & 0x0000ffff + (x >> 16) & 0x0000ffff;
// 0x00000000 0x00000006 0x0000000d
// least sign.bit pos. was: 0 6 13
There is an x86 assembly instruction (bsf
) that will do it. :)
More optimized?!
Optimization at this level is inherently architecture dependent. Today's processors are too complex (in terms of branch prediction, cache misses, pipelining) that it's so hard to predict which code is executed faster on which architecture. Decreasing operations from 32 to 9 or things like that might even decrease the performance on some architectures. Optimized code on a single architecture might result in worse code in the other. I think you'd either optimize this for a specific CPU or leave it as it is and let the compiler to choose what it thinks it's better.
unsigned GetLowestBitPos(unsigned value)
{
if (value & 1) return 1;
if (value & 2) return 2;
if (value & 4) return 3;
if (value & 8) return 4;
if (value & 16) return 5;
if (value & 32) return 6;
if (value & 64) return 7;
if (value & 128) return 8;
if (value & 256) return 9;
if (value & 512) return 10;
if (value & 1024) return 11;
if (value & 2048) return 12;
if (value & 4096) return 13;
if (value & 8192) return 14;
if (value & 16384) return 15;
if (value & 32768) return 16;
if (value & 65536) return 17;
if (value & 131072) return 18;
if (value & 262144) return 19;
if (value & 524288) return 20;
if (value & 1048576) return 21;
if (value & 2097152) return 22;
if (value & 4194304) return 23;
if (value & 8388608) return 24;
if (value & 16777216) return 25;
if (value & 33554432) return 26;
if (value & 67108864) return 27;
if (value & 134217728) return 28;
if (value & 268435456) return 29;
if (value & 536870912) return 30;
return 31;
}
50% of all numbers will return on the first line of code.
75% of all numbers will return on the first 2 lines of code.
87% of all numbers will return in the first 3 lines of code.
94% of all numbers will return in the first 4 lines of code.
97% of all numbers will return in the first 5 lines of code.
etc.
I think people that are complaining on how inefficient the worst case scenario for this code don't understand how rare that condition will happen.
After 11 years we finally have: countr_zero
Well done C++20