What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C?

后端 未结 27 2694
终归单人心
终归单人心 2020-11-22 03:35

If I have some integer n, and I want to know the position of the most significant bit (that is, if the least significant bit is on the right, I want to know the position of

相关标签:
27条回答
  • 2020-11-22 04:25

    Putting this in since it's 'yet another' approach, seems to be different from others already given.

    returns -1 if x==0, otherwise floor( log2(x)) (max result 31)

    Reduce from 32 to 4 bit problem, then use a table. Perhaps inelegant, but pragmatic.

    This is what I use when I don't want to use __builtin_clz because of portability issues.

    To make it more compact, one could instead use a loop to reduce, adding 4 to r each time, max 7 iterations. Or some hybrid, such as (for 64 bits): loop to reduce to 8, test to reduce to 4.

    int log2floor( unsigned x ){
       static const signed char wtab[16] = {-1,0,1,1, 2,2,2,2, 3,3,3,3,3,3,3,3};
       int r = 0;
       unsigned xk = x >> 16;
       if( xk != 0 ){
           r = 16;
           x = xk;
       }
       // x is 0 .. 0xFFFF
       xk = x >> 8;
       if( xk != 0){
           r += 8;
           x = xk;
       }
       // x is 0 .. 0xFF
       xk = x >> 4;
       if( xk != 0){
           r += 4;
           x = xk;
       }
       // now x is 0..15; x=0 only if originally zero.
       return r + wtab[x];
    }
    
    0 讨论(0)
  • 2020-11-22 04:26

    Another poster provided a lookup-table using a byte-wide lookup. In case you want to eke out a bit more performance (at the cost of 32K of memory instead of just 256 lookup entries) here is a solution using a 15-bit lookup table, in C# 7 for .NET.

    The interesting part is initializing the table. Since it's a relatively small block that we want for the lifetime of the process, I allocate unmanaged memory for this by using Marshal.AllocHGlobal. As you can see, for maximum performance, the whole example is written as native:

    readonly static byte[] msb_tab_15;
    
    // Initialize a table of 32768 bytes with the bit position (counting from LSB=0)
    // of the highest 'set' (non-zero) bit of its corresponding 16-bit index value.
    // The table is compressed by half, so use (value >> 1) for indexing.
    static MyStaticInit()
    {
        var p = new byte[0x8000];
    
        for (byte n = 0; n < 16; n++)
            for (int c = (1 << n) >> 1, i = 0; i < c; i++)
                p[c + i] = n;
    
        msb_tab_15 = p;
    }
    

    The table requires one-time initialization via the code above. It is read-only so a single global copy can be shared for concurrent access. With this table you can quickly look up the integer log2, which is what we're looking for here, for all the various integer widths (8, 16, 32, and 64 bits).

    Notice that the table entry for 0, the sole integer for which the notion of 'highest set bit' is undefined, is given the value -1. This distinction is necessary for proper handling of 0-valued upper words in the code below. Without further ado, here is the code for each of the various integer primitives:

    ulong (64-bit) Version

    /// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>
    public static int HighestOne(this ulong v)
    {
        if ((long)v <= 0)
            return (int)((v >> 57) & 0x40) - 1;      // handles cases v==0 and MSB==63
    
        int j = /**/ (int)((0xFFFFFFFFU - v /****/) >> 58) & 0x20;
        j |= /*****/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 0x10;
        return j + msb_tab_15[v >> (j + 1)];
    }
    

    uint (32-bit) Version

    /// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>
    public static int HighestOne(uint v)
    {
        if ((int)v <= 0)
            return (int)((v >> 26) & 0x20) - 1;     // handles cases v==0 and MSB==31
    
        int j = (int)((0x0000FFFFU - v) >> 27) & 0x10;
        return j + msb_tab_15[v >> (j + 1)];
    }
    

    Various overloads for the above

    public static int HighestOne(long v) => HighestOne((ulong)v);
    public static int HighestOne(int v) => HighestOne((uint)v);
    public static int HighestOne(ushort v) => msb_tab_15[v >> 1];
    public static int HighestOne(short v) => msb_tab_15[(ushort)v >> 1];
    public static int HighestOne(char ch) => msb_tab_15[ch >> 1];
    public static int HighestOne(sbyte v) => msb_tab_15[(byte)v >> 1];
    public static int HighestOne(byte v) => msb_tab_15[v >> 1];
    

    This is a complete, working solution which represents the best performance on .NET 4.7.2 for numerous alternatives that I compared with a specialized performance test harness. Some of these are mentioned below. The test parameters were a uniform density of all 65 bit positions, i.e., 0 ... 31/63 plus value 0 (which produces result -1). The bits below the target index position were filled randomly. The tests were x64 only, release mode, with JIT-optimizations enabled.




    That's the end of my formal answer here; what follows are some casual notes and links to source code for alternative test candidates associated with the testing I ran to validate the performance and correctness of the above code.


    The version provided above above, coded as Tab16A was a consistent winner over many runs. These various candidates, in active working/scratch form, can be found here, here, and here.

     1  candidates.HighestOne_Tab16A               622,496
     2  candidates.HighestOne_Tab16C               628,234
     3  candidates.HighestOne_Tab8A                649,146
     4  candidates.HighestOne_Tab8B                656,847
     5  candidates.HighestOne_Tab16B               657,147
     6  candidates.HighestOne_Tab16D               659,650
     7  _highest_one_bit_UNMANAGED.HighestOne_U    702,900
     8  de_Bruijn.IndexOfMSB                       709,672
     9  _old_2.HighestOne_Old2                     715,810
    10  _test_A.HighestOne8                        757,188
    11  _old_1.HighestOne_Old1                     757,925
    12  _test_A.HighestOne5  (unsafe)              760,387
    13  _test_B.HighestOne8  (unsafe)              763,904
    14  _test_A.HighestOne3  (unsafe)              766,433
    15  _test_A.HighestOne1  (unsafe)              767,321
    16  _test_A.HighestOne4  (unsafe)              771,702
    17  _test_B.HighestOne2  (unsafe)              772,136
    18  _test_B.HighestOne1  (unsafe)              772,527
    19  _test_B.HighestOne3  (unsafe)              774,140
    20  _test_A.HighestOne7  (unsafe)              774,581
    21  _test_B.HighestOne7  (unsafe)              775,463
    22  _test_A.HighestOne2  (unsafe)              776,865
    23  candidates.HighestOne_NoTab                777,698
    24  _test_B.HighestOne6  (unsafe)              779,481
    25  _test_A.HighestOne6  (unsafe)              781,553
    26  _test_B.HighestOne4  (unsafe)              785,504
    27  _test_B.HighestOne5  (unsafe)              789,797
    28  _test_A.HighestOne0  (unsafe)              809,566
    29  _test_B.HighestOne0  (unsafe)              814,990
    30  _highest_one_bit.HighestOne                824,345
    30  _bitarray_ext.RtlFindMostSignificantBit    894,069
    31  candidates.HighestOne_Naive                898,865

    Notable is that the terrible performance of ntdll.dll!RtlFindMostSignificantBit via P/Invoke:

    [DllImport("ntdll.dll"), SuppressUnmanagedCodeSecurity, SecuritySafeCritical]
    public static extern int RtlFindMostSignificantBit(ulong ul);
    

    It's really too bad, because here's the entire actual function:

        RtlFindMostSignificantBit:
            bsr rdx, rcx  
            mov eax,0FFFFFFFFh  
            movzx ecx, dl  
            cmovne      eax,ecx  
            ret
    

    I can't imagine the poor performance originating with these five lines, so the managed/native transition penalties must be to blame. I was also surprised that the testing really favored the 32KB (and 64KB) short (16-bit) direct-lookup tables over the 128-byte (and 256-byte) byte (8-bit) lookup tables. I thought the following would be more competitive with the 16-bit lookups, but the latter consistently outperformed this:

    public static int HighestOne_Tab8A(ulong v)
    {
        if ((long)v <= 0)
            return (int)((v >> 57) & 64) - 1;
    
        int j;
        j =  /**/ (int)((0xFFFFFFFFU - v) >> 58) & 32;
        j += /**/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 16;
        j += /**/ (int)((0x000000FFU - (v >> j)) >> 60) & 8;
        return j + msb_tab_8[v >> j];
    }
    

    The last thing I'll point out is that I was quite shocked that my deBruijn method didn't fare better. This is the method that I had previously been using pervasively:

    const ulong N_bsf64 = 0x07EDD5E59A4E28C2,
                N_bsr64 = 0x03F79D71B4CB0A89;
    
    readonly public static sbyte[]
    bsf64 =
    {
        63,  0, 58,  1, 59, 47, 53,  2, 60, 39, 48, 27, 54, 33, 42,  3,
        61, 51, 37, 40, 49, 18, 28, 20, 55, 30, 34, 11, 43, 14, 22,  4,
        62, 57, 46, 52, 38, 26, 32, 41, 50, 36, 17, 19, 29, 10, 13, 21,
        56, 45, 25, 31, 35, 16,  9, 12, 44, 24, 15,  8, 23,  7,  6,  5,
    },
    bsr64 =
    {
         0, 47,  1, 56, 48, 27,  2, 60, 57, 49, 41, 37, 28, 16,  3, 61,
        54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11,  4, 62,
        46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
        25, 39, 14, 33, 19, 30,  9, 24, 13, 18,  8, 12,  7,  6,  5, 63,
    };
    
    public static int IndexOfLSB(ulong v) =>
        v != 0 ? bsf64[((v & (ulong)-(long)v) * N_bsf64) >> 58] : -1;
    
    public static int IndexOfMSB(ulong v)
    {
        if ((long)v <= 0)
            return (int)((v >> 57) & 64) - 1;
    
        v |= v >> 1; v |= v >> 2;  v |= v >> 4;   // does anybody know a better
        v |= v >> 8; v |= v >> 16; v |= v >> 32;  // way than these 12 ops?
        return bsr64[(v * N_bsr64) >> 58];
    }
    

    There's much discussion of how superior and great deBruijn methods at this SO question, and I had tended to agree. My speculation is that, while both the deBruijn and direct lookup table methods (that I found to be fastest) both have to do a table lookup, and both have very minimal branching, only the deBruijn has a 64-bit multiply operation. I only tested the IndexOfMSB functions here--not the deBruijn IndexOfLSB--but I expect the latter to fare much better chance since it has so many fewer operations (see above), and I'll likely continue to use it for LSB.

    0 讨论(0)
  • 2020-11-22 04:28

    This should be lightning fast:

    int msb(unsigned int v) {
      static const int pos[32] = {0, 1, 28, 2, 29, 14, 24, 3,
        30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,
        16, 7, 26, 12, 18, 6, 11, 5, 10, 9};
      v |= v >> 1;
      v |= v >> 2;
      v |= v >> 4;
      v |= v >> 8;
      v |= v >> 16;
      v = (v >> 1) + 1;
      return pos[(v * 0x077CB531UL) >> 27];
    }
    
    0 讨论(0)
提交回复
热议问题