Fastest way to produce a mask with n ones starting at position i

问题

What is the fastest way (in terms of cpu cycles on common modern architecture), to produce a mask with len bits set to 1 starting at position pos:

template <class UIntType>
constexpr T make_mask(std::size_t pos, std::size_t len)
{
    // Body of the function
}

// Call of the function
auto mask = make_mask<uint32_t>(4, 10);
// mask = 00000000 00000000 00111111 11110000 
// (in binary with MSB on the left and LSB on the right)

Plus, is there any compiler intrinsics or BMI function that can help?

回答1:

If by "starting at pos", you mean that the lowest-order bit of the mask is at the position corresponding with 2^pos (as in your example):

((UIntType(1) << len) - UIntType(1)) << pos

If it is possible that len is ≥ the number of bits in UIntType, avoid Undefined Behaviour with a test:

(((len < std::numeric_limits<UIntType>::digits)
     ? UIntType(1)<<len
     : 0) - UIntType(1)) << pos

(If it is also possible that pos is ≥ std::numeric_limits<UIntType>::digits, you'll need another ternary op test.)

You could also use:

(UIntType(1)<<(len>>1)<<((len+1)>>1) - UIntType(1)) << pos

which avoids the ternary op at the cost of three extra shift operators; I doubt whether it would be faster but careful benchmarking would be necessary to know for sure.

回答2:

Fastest way? I'd use something like this:

template <class T>
constexpr T make_mask(std::size_t pos, std::size_t len)
{
  return ((static_cast<T>(1) << len)-1) << pos;
}

回答3:

Maybe using a table? For type uint32_t you can write:

static uint32_t masks[] = { 0x0, 0x1, 0x3, 0x7, 0xf, 0x1f, 0x3f...}; // only 32 such masks
return masks[len] << pos;

Whatever is the int type the number of masks is not so huge and the table can be easily generated by templates.

For BMI, maybe using BZHI? Starting from all bits set, BZHI with value 32-len and then shift by pos.

回答4:

Speed is irrelevant here as the expression is constant, hence precomputed by the optimizer and in all likelyhood used as an immediate operand. Whatever you use, it will cost you 0 cycle.

回答5:

The biggest issue here is the range of possible inputs. In C, shifts with a count larger than the type width are Undefined Behaviour. However, it looks like len can meaningfully range from 0 to the type width. e.g. 33 different lengths for uint32_t. With pos=0, we get masks from 0 to 0xFFFFFFFF. (I'm just going to assume 32-bit in English and asm for clarity, but use generic C++).

If we can exclude either end of that range as possible inputs, then there are only 32 possible lengths, and we can use a left or right shift as a building block. (Use an assert() to verify the input range in debug builds.)

I put several versions (from other answers) of the function on the Godbolt compiler explorer with some macros to compile them with constant len, constant pos, or both inputs variable. Some do better than others. KIIV's looks good for the range it's valid for (len=0..31, pos=0..31).

This version works for len=1..32, and pos=0..31. It generates slightly worse x86-64 asm than KIIV's, so use KIIV's if it works without extra checks.

// right-shift a register of all-ones, then shift it into position.
// works for len=1..32 and pos=0..31
template <class T>
constexpr T make_mask_PJC(std::size_t pos, std::size_t len)
{
//  T all_ones = -1LL;
//  unsigned typebits = sizeof(T)*CHAR_BIT;  // std::numeric_limits<T>::digits
//  T len_ones = all_ones >> (typebits - len);
//  return len_ones << pos

  static_assert(std::numeric_limits<T>::radix == 2, "T isn't an integer type");
  return static_cast<T>(-1LL) >> (std::numeric_limits<T>::digits - len) << pos;  // pre-C++14 constexpr needs it all in one statement
}

// Same idea, but mask the shift count the same way x86 shift instructions do, so the compiler can do it for free.
// Doesn't always compile to ideal code with SHRX (BMI2), maybe gcc only knows about letting the shift instruction do the masking for the older SHR / SHL instructions
uint32_t make_mask_PJC_noUB(std::size_t pos, std::size_t len)
{
  using T=uint32_t;

  static_assert(std::numeric_limits<T>::radix == 2, "T isn't an integer type");

  T all_ones = -1LL;
  unsigned typebits = std::numeric_limits<T>::digits;
  T len_ones = all_ones >> ( (typebits - len) & (typebits-1));     // the AND optimizes away
  return len_ones << (pos & (typebits-1));

//  return static_cast<T>(-1LL) >> (std::numeric_limits<T>::digits - len) << pos;  // pre-C++14 constexpr needs it all in one statement
}

If len can be anything in [0..32], I don't have any great ideas for efficient branchless code. Perhaps branching is the way to go.

uint32_t make_mask_fullrange(std::size_t pos, std::size_t len)
{
  using T=uint32_t;

  static_assert(std::numeric_limits<T>::radix == 2, "T isn't an integer type");

  T all_ones = -1LL;
  unsigned typebits = std::numeric_limits<T>::digits;
  //T len_ones = all_ones >> ( (typebits - len) & (typebits-1));
  T len_ones = len==0 ? 0 : all_ones >> ( (typebits - len) & (typebits-1));
  return len_ones << (pos & (typebits-1));

//  return static_cast<T>(-1LL) >> (std::numeric_limits<T>::digits - len) << pos;  // pre-C++14 constexpr needs it all in one statement
}

来源：https://stackoverflow.com/questions/39321580/fastest-way-to-produce-a-mask-with-n-ones-starting-at-position-i

标签

c++

optimization

bit-manipulation

bitmask