Creating a mask with N least significant bits set

前端 未结 6 1709
离开以前
离开以前 2021-01-18 01:00

I would like to create a macro or function1 mask(n) which given a number n returns an unsigned integer with its n least sig

相关标签:
6条回答
  • 2021-01-18 01:09

    Here's one that is portable and conditional-free:

    unsigned long long mask(unsigned n)
    {
        assert (n <= sizeof(unsigned long long) * CHAR_BIT);
        return (1ULL << (n/2) << (n-(n/2))) - 1;
    }
    
    0 讨论(0)
  • 2021-01-18 01:10

    Another solution without branching

    unsigned long long mask(unsigned n)
    {
        return ((1ULL << (n & 0x3F)) & -(n != 64)) - 1;
    }
    

    n & 0x3F keeps the shift amount to maximum 63 in order to avoid UB. In fact most modern architectures will just grab the lower bits of the shift amount, so no and instruction is needed for this.

    The checking condition for 64 can be changed to -(n < 64) to make it return all ones for n ⩾ 64, which is equivalent to _bzhi_u64(-1ULL, (uint8_t)n) if your CPU supports BMI2.

    The output from Clang looks better than gcc. As it happens gcc emits conditional instructions for MIPS64 and ARM64 but not for x86-64, resulting in longer output


    The condition can also be simplified to n >> 6, utilizing the fact that it'll be one if n = 64. And we can subtract that from the result instead of creating a mask like above

    return (1ULL << (n & 0x3F)) - (n == 64) - 1; // or n >= 64
    return (1ULL << (n & 0x3F)) - (n >> 6) - 1;
    

    gcc compiles the latter to

    mov     eax, 1
    shlx    rax, rax, rdi
    shr     edi, 6
    dec     rax
    sub     rax, rdi
    ret
    

    Some more alternatives

    return ~((~0ULL << (n & 0x3F)) << (n == 64));
    return ((1ULL << (n & 0x3F)) - 1) | (((uint64_t)n >> 6) << 63);
    return (uint64_t)(((__uint128_t)1 << n) - 1); // if a 128-bit type is available
    

    A similar question for 32 bits: Set last `n` bits in unsigned int

    0 讨论(0)
  • 2021-01-18 01:14

    Try

    unsigned long long mask(const unsigned n)
    {
      assert(n <= 64);
      return (n == 64) ? 0xFFFFFFFFFFFFFFFFULL :
         (1ULL << n) - 1ULL;
    }
    

    There are several great, clever answers that avoid conditionals, but a modern compiler can generate code for this that doesn’t branch.

    Your compiler can probably figure out to inline this, but you might be able to give it a hint with inline or, in C++, constexpr.

    The unsigned long long int type is guaranteed to be at least 64 bits wide and present on every implementation, which uint64_t is not.

    If you need a macro (because you need something that works as a compile-time constant), that might be:

    #define mask(n) ((64U == (n)) ? 0xFFFFFFFFFFFFFFFFULL : (1ULL << (unsigned)(n)) - 1ULL)
    

    As several people correctly reminded me in the comments, 1ULL << 64U is potential undefined behavior! So, insert a check for that special case.

    You could replace 64U with CHAR_BITS*sizeof(unsigned long long) if it is important to you to support the full range of that type on an implementation where it is wider than 64 bits.

    You could similarly generate this from an unsigned right shift, but you would still need to check n == 64 as a special case, since right-shifting by the width of the type is undefined behavior.

    ETA:

    The relevant portion of the (N1570 Draft) standard says, of both left and right bit shifts:

    If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.

    This tripped me up. Thanks again to everyone in the comments who reviewed my code and pointed the bug out to me.

    0 讨论(0)
  • 2021-01-18 01:14

    This is not an answer to the exact question. It only works if 0 isn't a required output, but is more efficient.

    2n+1 - 1 computed without overflow. i.e. an integer with the low n bits set, for n = 0 .. all_bits

    Possibly using this inside a ternary for cmov could be a more efficient solution to the full problem in the question. Perhaps based on a left-rotate of a number with the MSB set, instead of a left-shift of 1, to take care of the difference in counting for this vs. the question for the pow2 calculation.

    // defined for n=0 .. sizeof(unsigned long long)*CHAR_BIT
    unsigned long long setbits_upto(unsigned n) {
        unsigned long long pow2 = 1ULL << n;
        return pow2*2 - 1;                  // one more shift, and subtract 1.
    }
    

    Compiler output suggests an alternate version, good on some ISAs if you're not using gcc/clang (which already do this): bake in an extra shift count so it is possible for the initial shift to shift out all the bits, leaving 0 - 1 = all bits set.

    unsigned long long setbits_upto2(unsigned n) {
        unsigned long long pow2 = 2ULL << n;      // bake in the extra shift count
        return pow2 - 1;
    }
    

    The table of inputs / outputs for a 32-bit version of this function is:

     n   ->  1<<n        ->    *2 - 1
    0    ->    1         ->   1        = 2 - 1
    1    ->    2         ->   3        = 4 - 1
    2    ->    4         ->   7        = 8 - 1
    3    ->    8         ->  15        = 16 - 1
    ...
    30   ->  0x40000000  ->  0x7FFFFFFF  = 0x80000000 - 1
    31   ->  0x80000000  ->  0xFFFFFFFF  = 0 - 1
    

    You could slap a cmov after it, or other way of handling an input that has to produce zero.


    On x86, we can efficiently compute this with 3 single-uop instructions: (Or 2 uops for BTS on Ryzen).

    xor  eax, eax
    bts  rax, rdi               ; rax = 1<<(n&63)
    lea  rax, [rax + rax - 1]   ; one more left shift, and subtract
    

    (3-component LEA has 3 cycle latency on Intel, but I believe this is optimal for uop count and thus throughput in many cases.)


    In C this compiles nicely for all 64-bit ISAs except x86 Intel SnB-family

    C compilers unfortunately are dumb and miss using bts even when tuning for Intel CPUs without BMI2 (where shl reg,cl is 3 uops).

    e.g. gcc and clang both do this (with dec or add -1), on Godbolt

    # gcc9.1 -O3 -mtune=haswell
    setbits_upto(unsigned int):
        mov     ecx, edi
        mov     eax, 2       ; bake in the extra shift by 1.
        sal     rax, cl
        dec     rax
        ret
    

    MSVC starts with n in ECX because of the Windows x64 calling convention, but modulo that, it and ICC do the same thing:

    # ICC19
    setbits_upto(unsigned int):
        mov       eax, 1                                        #3.21
        mov       ecx, edi                                      #2.39
        shl       rax, cl                                       #2.39
        lea       rax, QWORD PTR [-1+rax+rax]                   #3.21
        ret                                                     #3.21
    

    With BMI2 (-march=haswell), we get optimal-for-AMD code from gcc/clang with -march=haswell

        mov     eax, 2
        shlx    rax, rax, rdi
        add     rax, -1
    

    ICC still uses a 3-component LEA, so if you target MSVC or ICC use the 2ULL << n version in the source whether or not you enable BMI2, because you're not getting BTS either way. And this avoids the worst of both worlds; slow-LEA and a variable-count shift instead of BTS.


    On non-x86 ISAs (where presumably variable-count shifts are efficient because they don't have the x86 tax of leaving flags unmodified if the count happens to be zero, and can use any register as the count), this compiles just fine.

    e.g. AArch64. And of course this can hoist the constant 2 for reuse with different n, like x86 can with BMI2 shlx.

    setbits_upto(unsigned int):
        mov     x1, 2
        lsl     x0, x1, x0
        sub     x0, x0, #1
        ret
    

    Basically the same on PowerPC, RISC-V, etc.

    0 讨论(0)
  • 2021-01-18 01:18
    #include <stdint.h>
    
    uint64_t mask_n_bits(const unsigned n){
      uint64_t ret = n < 64;
      ret <<= n&63; //the &63 is typically optimized away
      ret -= 1;
      return ret;
    }
    

    Results:

    mask_n_bits:
        xor     eax, eax
        cmp     edi, 63
        setbe   al
        shlx    rax, rax, rdi
        dec     rax
        ret
    

    Returns expected results and if passed a constant value it will be optimized to a constant mask in clang and gcc as well as icc at -O2 (but not -Os) .

    Explanation:

    The &63 gets optimized away, but ensures the shift is <=64.

    For values less than 64 it just sets the first n bits using (1<<n)-1. 1<<n sets the nth bit (equivalent pow(2,n)) and subtracting 1 from a power of 2 sets all bits less than that.

    By using the conditional to set the initial 1 to be shifted, no branch is created, yet it gives you a 0 for all values >=64 because left shifting a 0 will always yield 0. Therefore when we subtract 1, we get all bits set for values of 64 and larger (because of 2s complement representation for -1).

    Caveats:

    • 1s complement systems must die - requires special casing if you have one
    • some compilers may not optimize the &63 away
    0 讨论(0)
  • 2021-01-18 01:22

    When the input N is between 1 and 64, we can use -uint64_t(1) >> (64-N & 63).
    The constant -1 has 64 set bits and we shift 64-N of them away, so we're left with N set bits.

    When N=0, we can make the constant zero before shifting:

    uint64_t mask(unsigned N)
    {
        return -uint64_t(N != 0) >> (64-N & 63);
    }
    

    This compiles to five instructions in x64 clang. The neg instruction sets the carry flag to N != 0 and the sbb instruction turns the carry flag into 0 or -1. The shift length 64-N & 63 was optimized to -N: the shr instruction already has an implicit shift_length & 63.

    mov rcx,rdi
    neg rcx
    sbb rax,rax
    shr rax,cl
    ret
    

    With the BMI2 extension, it's only four instructions (the shift length can stay in rdi):

    neg edi
    sbb rax,rax
    shrx rax,rax,rdi
    ret
    
    0 讨论(0)
提交回复
热议问题