What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C?

后端 未结 27 2696
终归单人心
终归单人心 2020-11-22 03:35

If I have some integer n, and I want to know the position of the most significant bit (that is, if the least significant bit is on the right, I want to know the position of

相关标签:
27条回答
  • 2020-11-22 04:07

    Assuming you're on x86 and game for a bit of inline assembler, Intel provides a BSR instruction ("bit scan reverse"). It's fast on some x86s (microcoded on others). From the manual:

    Searches the source operand for the most significant set bit (1 bit). If a most significant 1 bit is found, its bit index is stored in the destination operand. The source operand can be a register or a memory location; the destination operand is a register. The bit index is an unsigned offset from bit 0 of the source operand. If the content source operand is 0, the content of the destination operand is undefined.

    (If you're on PowerPC there's a similar cntlz ("count leading zeros") instruction.)

    Example code for gcc:

    #include <iostream>
    
    int main (int,char**)
    {
      int n=1;
      for (;;++n) {
        int msb;
        asm("bsrl %1,%0" : "=r"(msb) : "r"(n));
        std::cout << n << " : " << msb << std::endl;
      }
      return 0;
    }
    

    See also this inline assembler tutorial, which shows (section 9.4) it being considerably faster than looping code.

    0 讨论(0)
  • 2020-11-22 04:08

    Kaz Kylheku here

    I benchmarked two approaches for this over 63 bit numbers (the long long type on gcc x86_64), staying away from the sign bit.

    (I happen to need this "find highest bit" for something, you see.)

    I implemented the data-driven binary search (closely based on one of the above answers). I also implemented a completely unrolled decision tree by hand, which is just code with immediate operands. No loops, no tables.

    The decision tree (highest_bit_unrolled) benchmarked to be 69% faster, except for the n = 0 case for which the binary search has an explicit test.

    The binary-search's special test for 0 case is only 48% faster than the decision tree, which does not have a special test.

    Compiler, machine: (GCC 4.5.2, -O3, x86-64, 2867 Mhz Intel Core i5).

    int highest_bit_unrolled(long long n)
    {
      if (n & 0x7FFFFFFF00000000) {
        if (n & 0x7FFF000000000000) {
          if (n & 0x7F00000000000000) {
            if (n & 0x7000000000000000) {
              if (n & 0x4000000000000000)
                return 63;
              else
                return (n & 0x2000000000000000) ? 62 : 61;
            } else {
              if (n & 0x0C00000000000000)
                return (n & 0x0800000000000000) ? 60 : 59;
              else
                return (n & 0x0200000000000000) ? 58 : 57;
            }
          } else {
            if (n & 0x00F0000000000000) {
              if (n & 0x00C0000000000000)
                return (n & 0x0080000000000000) ? 56 : 55;
              else
                return (n & 0x0020000000000000) ? 54 : 53;
            } else {
              if (n & 0x000C000000000000)
                return (n & 0x0008000000000000) ? 52 : 51;
              else
                return (n & 0x0002000000000000) ? 50 : 49;
            }
          }
        } else {
          if (n & 0x0000FF0000000000) {
            if (n & 0x0000F00000000000) {
              if (n & 0x0000C00000000000)
                return (n & 0x0000800000000000) ? 48 : 47;
              else
                return (n & 0x0000200000000000) ? 46 : 45;
            } else {
              if (n & 0x00000C0000000000)
                return (n & 0x0000080000000000) ? 44 : 43;
              else
                return (n & 0x0000020000000000) ? 42 : 41;
            }
          } else {
            if (n & 0x000000F000000000) {
              if (n & 0x000000C000000000)
                return (n & 0x0000008000000000) ? 40 : 39;
              else
                return (n & 0x0000002000000000) ? 38 : 37;
            } else {
              if (n & 0x0000000C00000000)
                return (n & 0x0000000800000000) ? 36 : 35;
              else
                return (n & 0x0000000200000000) ? 34 : 33;
            }
          }
        }
      } else {
        if (n & 0x00000000FFFF0000) {
          if (n & 0x00000000FF000000) {
            if (n & 0x00000000F0000000) {
              if (n & 0x00000000C0000000)
                return (n & 0x0000000080000000) ? 32 : 31;
              else
                return (n & 0x0000000020000000) ? 30 : 29;
            } else {
              if (n & 0x000000000C000000)
                return (n & 0x0000000008000000) ? 28 : 27;
              else
                return (n & 0x0000000002000000) ? 26 : 25;
            }
          } else {
            if (n & 0x0000000000F00000) {
              if (n & 0x0000000000C00000)
                return (n & 0x0000000000800000) ? 24 : 23;
              else
                return (n & 0x0000000000200000) ? 22 : 21;
            } else {
              if (n & 0x00000000000C0000)
                return (n & 0x0000000000080000) ? 20 : 19;
              else
                return (n & 0x0000000000020000) ? 18 : 17;
            }
          }
        } else {
          if (n & 0x000000000000FF00) {
            if (n & 0x000000000000F000) {
              if (n & 0x000000000000C000)
                return (n & 0x0000000000008000) ? 16 : 15;
              else
                return (n & 0x0000000000002000) ? 14 : 13;
            } else {
              if (n & 0x0000000000000C00)
                return (n & 0x0000000000000800) ? 12 : 11;
              else
                return (n & 0x0000000000000200) ? 10 : 9;
            }
          } else {
            if (n & 0x00000000000000F0) {
              if (n & 0x00000000000000C0)
                return (n & 0x0000000000000080) ? 8 : 7;
              else
                return (n & 0x0000000000000020) ? 6 : 5;
            } else {
              if (n & 0x000000000000000C)
                return (n & 0x0000000000000008) ? 4 : 3;
              else
                return (n & 0x0000000000000002) ? 2 : (n ? 1 : 0);
            }
          }
        }
      }
    }
    
    int highest_bit(long long n)
    {
      const long long mask[] = {
        0x000000007FFFFFFF,
        0x000000000000FFFF,
        0x00000000000000FF,
        0x000000000000000F,
        0x0000000000000003,
        0x0000000000000001
      };
      int hi = 64;
      int lo = 0;
      int i = 0;
    
      if (n == 0)
        return 0;
    
      for (i = 0; i < sizeof mask / sizeof mask[0]; i++) {
        int mi = lo + (hi - lo) / 2;
    
        if ((n >> mi) != 0)
          lo = mi;
        else if ((n & (mask[i] << lo)) != 0)
          hi = mi;
      }
    
      return lo + 1;
    }
    

    Quick and dirty test program:

    #include <stdio.h>
    #include <time.h>
    #include <stdlib.h>
    
    int highest_bit_unrolled(long long n);
    int highest_bit(long long n);
    
    main(int argc, char **argv)
    {
      long long n = strtoull(argv[1], NULL, 0);
      int b1, b2;
      long i;
      clock_t start = clock(), mid, end;
    
      for (i = 0; i < 1000000000; i++)
        b1 = highest_bit_unrolled(n);
    
      mid = clock();
    
      for (i = 0; i < 1000000000; i++)
        b2 = highest_bit(n);
    
      end = clock();
    
      printf("highest bit of 0x%llx/%lld = %d, %d\n", n, n, b1, b2);
    
      printf("time1 = %d\n", (int) (mid - start));
      printf("time2 = %d\n", (int) (end - mid));
      return 0;
    }
    

    Using only -O2, the difference becomes greater. The decision tree is almost four times faster.

    I also benchmarked against the naive bit shifting code:

    int highest_bit_shift(long long n)
    {
      int i = 0;
      for (; n; n >>= 1, i++)
        ; /* empty */
      return i;
    }
    

    This is only fast for small numbers, as one would expect. In determining that the highest bit is 1 for n == 1, it benchmarked more than 80% faster. However, half of randomly chosen numbers in the 63 bit space have the 63rd bit set!

    On the input 0x3FFFFFFFFFFFFFFF, the decision tree version is quite a bit faster than it is on 1, and shows to be 1120% faster (12.2 times) than the bit shifter.

    I will also benchmark the decision tree against the GCC builtins, and also try a mixture of inputs rather than repeating against the same number. There may be some sticking branch prediction going on and perhaps some unrealistic caching scenarios which makes it artificially faster on repetitions.

    0 讨论(0)
  • 2020-11-22 04:08

    I had a need for a routine to do this and before searching the web (and finding this page) I came up with my own solution basedon a binary search. Although I'm sure someone has done this before! It runs in constant time and can be faster than the "obvious" solution posted, although I'm not making any great claims, just posting it for interest.

    int highest_bit(unsigned int a) {
      static const unsigned int maskv[] = { 0xffff, 0xff, 0xf, 0x3, 0x1 };
      const unsigned int *mask = maskv;
      int l, h;
    
      if (a == 0) return -1;
    
      l = 0;
      h = 32;
    
      do {
        int m = l + (h - l) / 2;
    
        if ((a >> m) != 0) l = m;
        else if ((a & (*mask << l)) != 0) h = m;
    
        mask++;
      } while (l < h - 1);
    
      return l;
    }
    
    0 讨论(0)
  • 2020-11-22 04:08

    Think bitwise operators.

    I missunderstood the question the first time. You should produce an int with the leftmost bit set (the others zero). Assuming cmp is set to that value:

    position = sizeof(int)*8
    while(!(n & cmp)){ 
       n <<=1;
       position--;
    }
    
    0 讨论(0)
  • 2020-11-22 04:09

    I know this question is very old, but just having implemented an msb() function myself, I found that most solutions presented here and on other websites are not necessarily the most efficient - at least for my personal definition of efficiency (see also Update below). Here's why:

    Most solutions (especially those which employ some sort of binary search scheme or the naïve approach which does a linear scan from right to left) seem to neglect the fact that for arbitrary binary numbers, there are not many which start with a very long sequence of zeros. In fact, for any bit-width, half of all integers start with a 1 and a quarter of them start with 01. See where i'm getting at? My argument is that a linear scan starting from the most significant bit position to the least significant (left to right) is not so "linear" as it might look like at first glance.

    It can be shown1, that for any bit-width, the average number of bits that need to be tested is at most 2. This translates to an amortized time complexity of O(1) with respect to the number of bits (!).

    Of course, the worst case is still O(n), worse than the O(log(n)) you get with binary-search-like approaches, but since there are so few worst cases, they are negligible for most applications (Update: not quite: There may be few, but they might occur with high probability - see Update below).

    Here is the "naïve" approach i've come up with, which at least on my machine beats most other approaches (binary search schemes for 32-bit ints always require log2(32) = 5 steps, whereas this silly algorithm requires less than 2 on average) - sorry for this being C++ and not pure C:

    template <typename T>
    auto msb(T n) -> int
    {
        static_assert(std::is_integral<T>::value && !std::is_signed<T>::value,
            "msb<T>(): T must be an unsigned integral type.");
    
        for (T i = std::numeric_limits<T>::digits - 1, mask = 1 << i; i >= 0; --i, mask >>= 1)
        {
            if ((n & mask) != 0)
                return i;
        }
    
        return 0;
    }
    

    Update: While what i wrote here is perfectly true for arbitrary integers, where every combination of bits is equally probable (my speed test simply measured how long it took to determine the MSB for all 32-bit integers), real-life integers, for which such a function will be called, usually follow a different pattern: In my code, for example, this function is used to determine whether an object size is a power of 2, or to find the next power of 2 greater or equal than an object size. My guess is that most applications using the MSB involve numbers which are much smaller than the maximum number an integer can represent (object sizes rarely utilize all the bits in a size_t). In this case, my solution will actually perform worse than a binary search approach - so the latter should probably be preferred, even though my solution will be faster looping through all integers.
    TL;DR: Real-life integers will probably have a bias towards the worst case of this simple algorithm, which will make it perform worse in the end - despite the fact that it's amortized O(1) for truly arbitrary integers.

    1The argument goes like this (rough draft): Let n be the number of bits (bit-width). There are a total of 2n integers wich can be represented with n bits. There are 2n - 1 integers starting with a 1 (first 1 is fixed, remaining n - 1 bits can be anything). Those integers require only one interation of the loop to determine the MSB. Further, There are 2n - 2 integers starting with 01, requiring 2 iterations, 2n - 3 integers starting with 001, requiring 3 iterations, and so on.

    If we sum up all the required iterations for all possible integers and divide them by 2n, the total number of integers, we get the average number of iterations needed for determining the MSB for n-bit integers:

    (1 * 2n - 1 + 2 * 2n - 2 + 3 * 2n - 3 + ... + n) / 2n

    This series of average iterations is actually convergent and has a limit of 2 for n towards infinity

    Thus, the naïve left-to-right algorithm has actually an amortized constant time complexity of O(1) for any number of bits.

    0 讨论(0)
  • 2020-11-22 04:10

    Expanding on Josh's benchmark... one can improve the clz as follows

    /***************** clz2 ********************/
    
    #define NUM_OF_HIGHESTBITclz2(a) ((a)                              \
                      ? (((1U) << (sizeof(unsigned)*8-1)) >> __builtin_clz(a)) \
                      : 0)
    

    Regarding the asm: note that there are bsr and bsrl (this is the "long" version). the normal one might be a bit faster.

    0 讨论(0)
提交回复
热议问题