I am looking for an efficient way to determine the position of the least significant bit that is set in an integer, e.g. for 0x0FF0 it would be 4.
A trivial impleme
Most modern architectures will have some instruction for finding the position of the lowest set bit, or the highest set bit, or counting the number of leading zeroes etc.
If you have any one instruction of this class you can cheaply emulate the others.
Take a moment to work through it on paper and realise that x & (x-1)
will clear the lowest set bit in x, and ( x & ~(x-1) )
will return just the lowest set bit, irrespective of achitecture, word length etc. Knowing this, it is trivial to use hardware count-leading-zeroes / highest-set-bit to find the lowest set bit if there is no explicit instruction to do so.
If there is no relevant hardware support at all, the multiply-and-lookup implementation of count-leading-zeroes given here or one of the ones on the Bit Twiddling Hacks page can trivially be converted to give lowest set bit using the above identities and has the advantage of being branchless.
Another method (modulus division and lookup) deserves a special mention here from the same link provided by @anton-tykhyy. this method is very similar in performance to DeBruijn multiply and lookup method with a slight but important difference.
modulus division and lookup
unsigned int v; // find the number of trailing zeros in v
int r; // put the result in r
static const int Mod37BitPosition[] = // map a bit value mod 37 to its position
{
32, 0, 1, 26, 2, 23, 27, 0, 3, 16, 24, 30, 28, 11, 0, 13, 4,
7, 17, 0, 25, 22, 31, 15, 29, 10, 12, 6, 0, 21, 14, 9, 5,
20, 8, 19, 18
};
r = Mod37BitPosition[(-v & v) % 37];
modulus division and lookup method returns different values for v=0x00000000 and v=FFFFFFFF whereas DeBruijn multiply and lookup method returns zero on both inputs.
test:-
unsigned int n1=0x00000000, n2=0xFFFFFFFF;
MultiplyDeBruijnBitPosition[((unsigned int )((n1 & -n1) * 0x077CB531U)) >> 27]); /* returns 0 */
MultiplyDeBruijnBitPosition[((unsigned int )((n2 & -n2) * 0x077CB531U)) >> 27]); /* returns 0 */
Mod37BitPosition[(((-(n1) & (n1))) % 37)]); /* returns 32 */
Mod37BitPosition[(((-(n2) & (n2))) % 37)]); /* returns 0 */
You could check if any of the lower order bits are set. If so then look at the lower order of the remaining bits. e.g.,:
32bit int - check if any of the first 16 are set. If so, check if any of the first 8 are set. if so, ....
if not, check if any of the upper 16 are set..
Essentially it's binary search.
It can be done with a worst case of less than 32 operations:
Principle: Checking for 2 or more bits is just as efficient as checking for 1 bit.
So for example there's nothing stopping you from checking for which grouping its in first, then checking each bit from smallest to biggest in that group.
So...
if you check 2 bits at a time you have in the worst case (Nbits/2) + 1 checks total.
if you check 3 bits at a time you have in the worst case (Nbits/3) + 2 checks total.
...
Optimal would be to check in groups of 4. Which would require in the worst case 11 operations instead of your 32.
The best case goes from your algorithms's 1 check though to 2 checks if you use this grouping idea. But that extra 1 check in best case is worth it for the worst case savings.
Note: I write it out in full instead of using a loop because it's more efficient that way.
int getLowestBitPos(unsigned int value)
{
//Group 1: Bits 0-3
if(value&0xf)
{
if(value&0x1)
return 0;
else if(value&0x2)
return 1;
else if(value&0x4)
return 2;
else
return 3;
}
//Group 2: Bits 4-7
if(value&0xf0)
{
if(value&0x10)
return 4;
else if(value&0x20)
return 5;
else if(value&0x40)
return 6;
else
return 7;
}
//Group 3: Bits 8-11
if(value&0xf00)
{
if(value&0x100)
return 8;
else if(value&0x200)
return 9;
else if(value&0x400)
return 10;
else
return 11;
}
//Group 4: Bits 12-15
if(value&0xf000)
{
if(value&0x1000)
return 12;
else if(value&0x2000)
return 13;
else if(value&0x4000)
return 14;
else
return 15;
}
//Group 5: Bits 16-19
if(value&0xf0000)
{
if(value&0x10000)
return 16;
else if(value&0x20000)
return 17;
else if(value&0x40000)
return 18;
else
return 19;
}
//Group 6: Bits 20-23
if(value&0xf00000)
{
if(value&0x100000)
return 20;
else if(value&0x200000)
return 21;
else if(value&0x400000)
return 22;
else
return 23;
}
//Group 7: Bits 24-27
if(value&0xf000000)
{
if(value&0x1000000)
return 24;
else if(value&0x2000000)
return 25;
else if(value&0x4000000)
return 26;
else
return 27;
}
//Group 8: Bits 28-31
if(value&0xf0000000)
{
if(value&0x10000000)
return 28;
else if(value&0x20000000)
return 29;
else if(value&0x40000000)
return 30;
else
return 31;
}
return -1;
}
See my answer here for how to do it with a single x86 instruction, except that to find the least significant set bit you'll want the BSF
("bit scan forward") instruction instead of BSR
described there.
Weee, loads of solutions and not a benchmark in sight. You people should be ashamed of yourselves ;-)
My machine is an Intel i530 (2.9 GHz), running Windows 7 64-bit. I compiled with a 32-bit version of MinGW.
$ gcc --version
gcc.exe (GCC) 4.7.2
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2
$ bench
Naive loop. Time = 2.91 (Original questioner)
De Bruijn multiply. Time = 1.16 (Tykhyy)
Lookup table. Time = 0.36 (Andrew Grant)
FFS instruction. Time = 0.90 (ephemient)
Branch free mask. Time = 3.48 (Dan / Jim Balter)
Double hack. Time = 3.41 (DocMax)
$ gcc bench.c -o bench.exe -std=c99 -Wall -O2 -march=native
$ bench
Naive loop. Time = 2.92
De Bruijn multiply. Time = 0.47
Lookup table. Time = 0.35
FFS instruction. Time = 0.68
Branch free mask. Time = 3.49
Double hack. Time = 0.92
My code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define ARRAY_SIZE 65536
#define NUM_ITERS 5000 // Number of times to process array
int find_first_bits_naive_loop(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
if (value == 0)
continue;
unsigned pos = 0;
while (!(value & 1))
{
value >>= 1;
++pos;
}
total += pos + 1;
}
}
return total;
}
int find_first_bits_de_bruijn(unsigned nums[ARRAY_SIZE])
{
static const int MultiplyDeBruijnBitPosition[32] =
{
1, 2, 29, 3, 30, 15, 25, 4, 31, 23, 21, 16, 26, 18, 5, 9,
32, 28, 14, 24, 22, 20, 17, 8, 27, 13, 19, 7, 12, 6, 11, 10
};
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int c = nums[i];
total += MultiplyDeBruijnBitPosition[((unsigned)((c & -c) * 0x077CB531U)) >> 27];
}
}
return total;
}
unsigned char lowestBitTable[256];
int get_lowest_set_bit(unsigned num) {
unsigned mask = 1;
for (int cnt = 1; cnt <= 32; cnt++, mask <<= 1) {
if (num & mask) {
return cnt;
}
}
return 0;
}
int find_first_bits_lookup_table(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned int value = nums[i];
// note that order to check indices will depend whether you are on a big
// or little endian machine. This is for little-endian
unsigned char *bytes = (unsigned char *)&value;
if (bytes[0])
total += lowestBitTable[bytes[0]];
else if (bytes[1])
total += lowestBitTable[bytes[1]] + 8;
else if (bytes[2])
total += lowestBitTable[bytes[2]] + 16;
else
total += lowestBitTable[bytes[3]] + 24;
}
}
return total;
}
int find_first_bits_ffs_instruction(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
total += __builtin_ffs(nums[i]);
}
}
return total;
}
int find_first_bits_branch_free_mask(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
int i16 = !(value & 0xffff) << 4;
value >>= i16;
int i8 = !(value & 0xff) << 3;
value >>= i8;
int i4 = !(value & 0xf) << 2;
value >>= i4;
int i2 = !(value & 0x3) << 1;
value >>= i2;
int i1 = !(value & 0x1);
int i0 = (value >> i1) & 1? 0 : -32;
total += i16 + i8 + i4 + i2 + i1 + i0 + 1;
}
}
return total;
}
int find_first_bits_double_hack(unsigned nums[ARRAY_SIZE])
{
int total = 0; // Prevent compiler from optimizing out the code
for (int j = 0; j < NUM_ITERS; j++) {
for (int i = 0; i < ARRAY_SIZE; i++) {
unsigned value = nums[i];
double d = value ^ (value - !!value);
total += (((int*)&d)[1]>>20)-1022;
}
}
return total;
}
int main() {
unsigned nums[ARRAY_SIZE];
for (int i = 0; i < ARRAY_SIZE; i++) {
nums[i] = rand() + (rand() << 15);
}
for (int i = 0; i < 256; i++) {
lowestBitTable[i] = get_lowest_set_bit(i);
}
clock_t start_time, end_time;
int result;
start_time = clock();
result = find_first_bits_naive_loop(nums);
end_time = clock();
printf("Naive loop. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_de_bruijn(nums);
end_time = clock();
printf("De Bruijn multiply. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_lookup_table(nums);
end_time = clock();
printf("Lookup table. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_ffs_instruction(nums);
end_time = clock();
printf("FFS instruction. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_branch_free_mask(nums);
end_time = clock();
printf("Branch free mask. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
start_time = clock();
result = find_first_bits_double_hack(nums);
end_time = clock();
printf("Double hack. Time = %.2f, result = %d\n",
(end_time - start_time) / (double)(CLOCKS_PER_SEC), result);
}