Creating a mask with N least significant bits set

萝らか妹 提交于 2019-12-04 03:05:20

问题


I would like to create a macro or function1mask(n) which given a number n returns an unsigned integer with its n least significant bits set. Although this seems like it should be a basic primitive with heavily discussed implementations which compile efficiently - this doesn't seem to be the case.

Of course, various implementations may have different sizes for the primitive integral types like unsigned int, so let's assume for the sake of concreteness that we are talking returning a uint64_t specifically although of course an acceptable solutions would work (with different definitions) for any unsigned integral type. In particular, the solution should be efficient when the type returned is equal to or smaller than the platform's native width.

Critically, this must work for all n in [0, 64]. In particular mask(0) == 0 and mask(64) == (uint64_t)-1. Many "obvious" solutions don't work for one of these two cases.

The most important criteria is correctness: only correct solutions which don't rely on undefined behavior are interesting.

The second most important criteria is performance: the idiom should ideally compile to approximately the most efficient platform-specific way to do this on common platforms.

A solution that sacrifices simplicity in the name of performance, e.g., that uses different implementations on different platforms, is fine.


1 The most general case is a function, but ideally it would also work as a macro, without re-evaluating any of its arguments more than once.


回答1:


Another solution without branching

unsigned long long mask(unsigned n)
{
    return ((1ULL << (n & 0x3F)) & -(n != 64)) - 1;
}

n & 0x3F keeps the shift amount to maximum 63 in order to avoid UB. In fact most modern architectures will just grab the lower bits of the shift amount, so no and instruction is needed for this.

The checking condition for 64 can be changed to -(n < 64) to make it return all ones for n ⩾ 64, which is equivalent to _bzhi_u64(-1ULL, (uint8_t)n) if your CPU supports BMI2.

The output from Clang looks better than gcc. As it happens gcc emits conditional instructions for MIPS64 and ARM64 but not for x86-64, resulting in longer output


The condition can also be simplified to n >> 6, utilizing the fact that it'll be one if n = 64. And we can subtract that from the result instead of creating a mask like above

return (1ULL << (n & 0x3F)) - (n == 64) - 1; // n >= 64
return (1ULL << (n & 0x3F)) - (n >> 6) - 1;

gcc compiles the latter to

mov     eax, 1
shlx    rax, rax, rdi
shr     edi, 6
dec     rax
sub     rax, rdi
ret

Some more alternatives

return ~((~0ULL << (n & 0x3F)) << (n == 64));
return ((1ULL << (n & 0x3F)) - 1) | (((uint64_t)n >> 6) << 63);

A similar question for 32 bits: Set last `n` bits in unsigned int




回答2:


Here's one that is portable and conditional-free:

unsigned long long mask(unsigned n)
{
    assert (n <= sizeof(unsigned long long) * CHAR_BIT);
    return (1ULL << (n/2) << (n-(n/2))) - 1;
}



回答3:


Try

unsigned long long mask(const unsigned n)
{
  assert(n <= 64);
  return (n == 64) ? 0xFFFFFFFFFFFFFFFFULL :
     (1ULL << n) - 1ULL;
}

There are several great, clever answers that avoid conditionals, but a modern compiler can generate code for this that doesn’t branch.

Your compiler can probably figure out to inline this, but you might be able to give it a hint with inline or, in C++, constexpr.

The unsigned long long int type is guaranteed to be at least 64 bits wide and present on every implementation, which uint64_t is not.

If you need a macro (because you need something that works as a compile-time constant), that might be:

#define mask(n) ((64U == (n)) ? 0xFFFFFFFFFFFFFFFFULL : (1ULL << (unsigned)(n)) - 1ULL)

As several people correctly reminded me in the comments, 1ULL << 64U is potential undefined behavior! So, insert a check for that special case.

You could replace 64U with CHAR_BITS*sizeof(unsigned long long) if it is important to you to support the full range of that type on an implementation where it is wider than 64 bits.

You could similarly generate this from an unsigned right shift, but you would still need to check n == 64 as a special case, since right-shifting by the width of the type is undefined behavior.

ETA:

The relevant portion of the (N1570 Draft) standard says, of both left and right bit shifts:

If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.

This tripped me up. Thanks again to everyone in the comments who reviewed my code and pointed the bug out to me.




回答4:


This is not an answer to the exact question. It only works if 0 isn't a required output, but is more efficient.

2n+1 - 1 computed without overflow. i.e. an integer with the low n bits set, for n = 0 .. all_bits

Possibly using this inside a ternary for cmov could be a more efficient solution to the full problem in the question. Perhaps based on a left-rotate of a number with the MSB set, instead of a left-shift of 1, to take care of the difference in counting for this vs. the question for the pow2 calculation.

// defined for n=0 .. sizeof(unsigned long long)*CHAR_BIT
unsigned long long setbits_upto(unsigned n) {
    unsigned long long pow2 = 1ULL << n;
    return pow2*2 - 1;                  // one more shift, and subtract 1.
}

Compiler output suggests an alternate version, good on some ISAs if you're not using gcc/clang (which already do this): bake in an extra shift count so it is possible for the initial shift to shift out all the bits, leaving 0 - 1 = all bits set.

unsigned long long setbits_upto2(unsigned n) {
    unsigned long long pow2 = 2ULL << n;      // bake in the extra shift count
    return pow2 - 1;
}

The table of inputs / outputs for a 32-bit version of this function is:

 n   ->  1<<n        ->    *2 - 1
0    ->    1         ->   1        = 2 - 1
1    ->    2         ->   3        = 4 - 1
2    ->    4         ->   7        = 8 - 1
3    ->    8         ->  15        = 16 - 1
...
30   ->  0x40000000  ->  0x7FFFFFFF  = 0x80000000 - 1
31   ->  0x80000000  ->  0xFFFFFFFF  = 0 - 1

You could slap a cmov after it, or other way of handling an input that has to produce zero.


On x86, we can efficiently compute this with 3 single-uop instructions: (Or 2 uops for BTS on Ryzen).

xor  eax, eax
bts  rax, rdi               ; rax = 1<<(n&63)
lea  rax, [rax + rax - 1]   ; one more left shift, and subtract

(3-component LEA has 3 cycle latency on Intel, but I believe this is optimal for uop count and thus throughput in many cases.)


In C this compiles nicely for all 64-bit ISAs except x86 Intel SnB-family

C compilers unfortunately are dumb and miss using bts even when tuning for Intel CPUs without BMI2 (where shl reg,cl is 3 uops).

e.g. gcc and clang both do this (with dec or add -1), on Godbolt

# gcc9.1 -O3 -mtune=haswell
setbits_upto(unsigned int):
    mov     ecx, edi
    mov     eax, 2       ; bake in the extra shift by 1.
    sal     rax, cl
    dec     rax
    ret

MSVC starts with n in ECX because of the Windows x64 calling convention, but modulo that, it and ICC do the same thing:

# ICC19
setbits_upto(unsigned int):
    mov       eax, 1                                        #3.21
    mov       ecx, edi                                      #2.39
    shl       rax, cl                                       #2.39
    lea       rax, QWORD PTR [-1+rax+rax]                   #3.21
    ret                                                     #3.21

With BMI2 (-march=haswell), we get optimal-for-AMD code from gcc/clang with -march=haswell

    mov     eax, 2
    shlx    rax, rax, rdi
    add     rax, -1

ICC still uses a 3-component LEA, so if you target MSVC or ICC use the 2ULL << n version in the source whether or not you enable BMI2, because you're not getting BTS either way. And this avoids the worst of both worlds; slow-LEA and a variable-count shift instead of BTS.


On non-x86 ISAs (where presumably variable-count shifts are efficient because they don't have the x86 tax of leaving flags unmodified if the count happens to be zero, and can use any register as the count), this compiles just fine.

e.g. AArch64. And of course this can hoist the constant 2 for reuse with different n, like x86 can with BMI2 shlx.

setbits_upto(unsigned int):
    mov     x1, 2
    lsl     x0, x1, x0
    sub     x0, x0, #1
    ret

Basically the same on PowerPC, RISC-V, etc.




回答5:


#include <stdint.h>

uint64_t mask_n_bits(const unsigned n){
  uint64_t ret = n < 64;
  ret <<= n&63; //the &63 is typically optimized away
  ret -= 1;
  return ret;
}

Results:

mask_n_bits:
    xor     eax, eax
    cmp     edi, 63
    setbe   al
    shlx    rax, rax, rdi
    dec     rax
    ret

Returns expected results and if passed a constant value it will be optimized to a constant mask in clang and gcc as well as icc at -O2 (but not -Os) .

Explanation:

The &63 gets optimized away, but ensures the shift is <=64.

For values less than 64 it just sets the first n bits using (1<<n)-1. 1<<n sets the nth bit (equivalent pow(2,n)) and subtracting 1 from a power of 2 sets all bits less than that.

By using the conditional to set the initial 1 to be shifted, no branch is created, yet it gives you a 0 for all values >=64 because left shifting a 0 will always yield 0. Therefore when we subtract 1, we get all bits set for values of 64 and larger (because of 2s complement representation for -1).

Caveats:

  • 1s complement systems must die - requires special casing if you have one
  • some compilers may not optimize the &63 away



回答6:


When the input N is between 1 and 64, we can use -uint64_t(1) >> (64-N & 63).
The constant -1 has 64 set bits and we shift 64-N of them away, so we're left with N set bits.

When N=0, we can make the constant zero before shifting:

uint64_t mask(unsigned N)
{
    return -uint64_t(N != 0) >> (64-N & 63);
}

This compiles to five instructions in x64 clang. The neg instruction sets the carry flag to N != 0 and the sbb instruction turns the carry flag into 0 or -1. The shift length 64-N & 63 was optimized to -N: the shr instruction already has an implicit shift_length & 63.

mov rcx,rdi
neg rcx
sbb rax,rax
shr rax,cl
ret

With the BMI2 extension, it's only four instructions (the shift length can stay in rdi):

neg edi
sbb rax,rax
shrx rax,rax,rdi
ret


来源:https://stackoverflow.com/questions/52573447/creating-a-mask-with-n-least-significant-bits-set

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!