I would like to create a macro or function1 mask(n)
which given a number n
returns an unsigned integer with its n
least sig
Here's one that is portable and conditional-free:
unsigned long long mask(unsigned n)
{
assert (n <= sizeof(unsigned long long) * CHAR_BIT);
return (1ULL << (n/2) << (n-(n/2))) - 1;
}
Another solution without branching
unsigned long long mask(unsigned n)
{
return ((1ULL << (n & 0x3F)) & -(n != 64)) - 1;
}
n & 0x3F
keeps the shift amount to maximum 63 in order to avoid UB. In fact most modern architectures will just grab the lower bits of the shift amount, so no and
instruction is needed for this.
The checking condition for 64 can be changed to -(n < 64)
to make it return all ones for n ⩾ 64, which is equivalent to _bzhi_u64(-1ULL, (uint8_t)n)
if your CPU supports BMI2.
The output from Clang looks better than gcc. As it happens gcc emits conditional instructions for MIPS64 and ARM64 but not for x86-64, resulting in longer output
The condition can also be simplified to n >> 6
, utilizing the fact that it'll be one if n = 64. And we can subtract that from the result instead of creating a mask like above
return (1ULL << (n & 0x3F)) - (n == 64) - 1; // or n >= 64
return (1ULL << (n & 0x3F)) - (n >> 6) - 1;
gcc compiles the latter to
mov eax, 1
shlx rax, rax, rdi
shr edi, 6
dec rax
sub rax, rdi
ret
Some more alternatives
return ~((~0ULL << (n & 0x3F)) << (n == 64));
return ((1ULL << (n & 0x3F)) - 1) | (((uint64_t)n >> 6) << 63);
return (uint64_t)(((__uint128_t)1 << n) - 1); // if a 128-bit type is available
A similar question for 32 bits: Set last `n` bits in unsigned int
Try
unsigned long long mask(const unsigned n)
{
assert(n <= 64);
return (n == 64) ? 0xFFFFFFFFFFFFFFFFULL :
(1ULL << n) - 1ULL;
}
There are several great, clever answers that avoid conditionals, but a modern compiler can generate code for this that doesn’t branch.
Your compiler can probably figure out to inline this, but you might be able to give it a hint with inline
or, in C++, constexpr
.
The unsigned long long int
type is guaranteed to be at least 64 bits wide and present on every implementation, which uint64_t
is not.
If you need a macro (because you need something that works as a compile-time constant), that might be:
#define mask(n) ((64U == (n)) ? 0xFFFFFFFFFFFFFFFFULL : (1ULL << (unsigned)(n)) - 1ULL)
As several people correctly reminded me in the comments, 1ULL << 64U
is potential undefined behavior! So, insert a check for that special case.
You could replace 64U
with CHAR_BITS*sizeof(unsigned long long)
if it is important to you to support the full range of that type on an implementation where it is wider than 64 bits.
You could similarly generate this from an unsigned right shift, but you would still need to check n == 64
as a special case, since right-shifting by the width of the type is undefined behavior.
The relevant portion of the (N1570 Draft) standard says, of both left and right bit shifts:
If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.
This tripped me up. Thanks again to everyone in the comments who reviewed my code and pointed the bug out to me.
0
isn't a required output, but is more efficient.2n+1 - 1 computed without overflow. i.e. an integer with the low n
bits set, for n = 0 .. all_bits
Possibly using this inside a ternary for cmov
could be a more efficient solution to the full problem in the question. Perhaps based on a left-rotate of a number with the MSB set, instead of a left-shift of 1
, to take care of the difference in counting for this vs. the question for the pow2
calculation.
// defined for n=0 .. sizeof(unsigned long long)*CHAR_BIT
unsigned long long setbits_upto(unsigned n) {
unsigned long long pow2 = 1ULL << n;
return pow2*2 - 1; // one more shift, and subtract 1.
}
Compiler output suggests an alternate version, good on some ISAs if you're not using gcc/clang (which already do this): bake in an extra shift count so it is possible for the initial shift to shift out all the bits, leaving 0 - 1 =
all bits set.
unsigned long long setbits_upto2(unsigned n) {
unsigned long long pow2 = 2ULL << n; // bake in the extra shift count
return pow2 - 1;
}
The table of inputs / outputs for a 32-bit version of this function is:
n -> 1<<n -> *2 - 1
0 -> 1 -> 1 = 2 - 1
1 -> 2 -> 3 = 4 - 1
2 -> 4 -> 7 = 8 - 1
3 -> 8 -> 15 = 16 - 1
...
30 -> 0x40000000 -> 0x7FFFFFFF = 0x80000000 - 1
31 -> 0x80000000 -> 0xFFFFFFFF = 0 - 1
You could slap a cmov
after it, or other way of handling an input that has to produce zero.
On x86, we can efficiently compute this with 3 single-uop instructions: (Or 2 uops for BTS on Ryzen).
xor eax, eax
bts rax, rdi ; rax = 1<<(n&63)
lea rax, [rax + rax - 1] ; one more left shift, and subtract
(3-component LEA has 3 cycle latency on Intel, but I believe this is optimal for uop count and thus throughput in many cases.)
C compilers unfortunately are dumb and miss using bts
even when tuning for Intel CPUs without BMI2 (where shl reg,cl
is 3 uops).
e.g. gcc and clang both do this (with dec or add -1), on Godbolt
# gcc9.1 -O3 -mtune=haswell
setbits_upto(unsigned int):
mov ecx, edi
mov eax, 2 ; bake in the extra shift by 1.
sal rax, cl
dec rax
ret
MSVC starts with n
in ECX because of the Windows x64 calling convention, but modulo that, it and ICC do the same thing:
# ICC19
setbits_upto(unsigned int):
mov eax, 1 #3.21
mov ecx, edi #2.39
shl rax, cl #2.39
lea rax, QWORD PTR [-1+rax+rax] #3.21
ret #3.21
With BMI2 (-march=haswell
), we get optimal-for-AMD code from gcc/clang with -march=haswell
mov eax, 2
shlx rax, rax, rdi
add rax, -1
ICC still uses a 3-component LEA, so if you target MSVC or ICC use the 2ULL << n
version in the source whether or not you enable BMI2, because you're not getting BTS either way. And this avoids the worst of both worlds; slow-LEA and a variable-count shift instead of BTS.
On non-x86 ISAs (where presumably variable-count shifts are efficient because they don't have the x86 tax of leaving flags unmodified if the count happens to be zero, and can use any register as the count), this compiles just fine.
e.g. AArch64. And of course this can hoist the constant 2
for reuse with different n
, like x86 can with BMI2 shlx
.
setbits_upto(unsigned int):
mov x1, 2
lsl x0, x1, x0
sub x0, x0, #1
ret
Basically the same on PowerPC, RISC-V, etc.
#include <stdint.h>
uint64_t mask_n_bits(const unsigned n){
uint64_t ret = n < 64;
ret <<= n&63; //the &63 is typically optimized away
ret -= 1;
return ret;
}
Results:
mask_n_bits:
xor eax, eax
cmp edi, 63
setbe al
shlx rax, rax, rdi
dec rax
ret
Returns expected results and if passed a constant value it will be optimized to a constant mask in clang and gcc as well as icc at -O2 (but not -Os) .
Explanation:
The &63 gets optimized away, but ensures the shift is <=64.
For values less than 64 it just sets the first n bits using (1<<n)-1
. 1<<n
sets the nth bit (equivalent pow(2,n)) and subtracting 1 from a power of 2 sets all bits less than that.
By using the conditional to set the initial 1 to be shifted, no branch is created, yet it gives you a 0 for all values >=64 because left shifting a 0 will always yield 0. Therefore when we subtract 1, we get all bits set for values of 64 and larger (because of 2s complement representation for -1).
Caveats:
When the input N is between 1 and 64, we can use -uint64_t(1) >> (64-N & 63)
.
The constant -1 has 64 set bits and we shift 64-N of them away, so we're left with N set bits.
When N=0, we can make the constant zero before shifting:
uint64_t mask(unsigned N)
{
return -uint64_t(N != 0) >> (64-N & 63);
}
This compiles to five instructions in x64 clang. The neg instruction sets the carry flag to N != 0
and the sbb instruction turns the carry flag into 0 or -1. The shift length 64-N & 63
was optimized to -N
: the shr instruction already has an implicit shift_length & 63
.
mov rcx,rdi
neg rcx
sbb rax,rax
shr rax,cl
ret
With the BMI2 extension, it's only four instructions (the shift length can stay in rdi):
neg edi
sbb rax,rax
shrx rax,rax,rdi
ret