Is there any way to write “mod 31” without modulus/division operators?

问题

Getting the modulus of a number can be easily done without the modulus operator or divisions, if your operand is a power of 2. In that case, the following formula holds: x % y = (x & (y − 1)). This is often many performant in many architectures. Can the same be done for mod 31?

int mod31(int a){ return a % 31; };

回答1:

Here are two ways to approach this problem. The first one using a common bit-twiddling technique, and if carefully optimized can beat hardware division. The other one substitutes a multiply for the divide, similar to the optimization performed by gcc, and is far and away the fastest. The bottom line is that there's not much point trying to avoid the % operator if the second argument is constant, because gcc's got it covered. (And probably other compilers, too.)

The following function is based on the fact that x is the same (mod 31) as the sum of the base-32 digits of x. That's true because 32 is 1 mod 31, and consequently any power of 32 is 1 mod 31. So each "digit" position in a base-32 number contributes the digit * 1 to the mod 31 sum. And it's easy to get the base-32 representation: we just take the bits five at a time.

(Like the rest of the functions in this answer, it will only work for non-negative x).

unsigned mod31(unsigned x) {
  unsigned tmp;
  for (tmp = 0; x; x >>= 5) {
    tmp += x & 31;
  }
  // Here we assume that there are at most 160 bits in x
  tmp = (tmp >> 5) + (tmp & 31);
  return tmp >= 31 ? tmp - 31 : tmp;
}

For a specific integer size, you could unroll the loop and quite possibly beat division. (And see @chux's answer for a way to convert the loop into O(log bits) operations instead of O(bits) It's more difficult to beat gcc, which avoids division when the dividend is a constant known at compile-time.

In a very quick benchmark using unsigned 32 bit integers, the naive unrolled loop took 19 seconds and a version based on @chux's answer took only 13 seconds, but gcc's x%31 took 9.7 seconds. Forcing gcc to use a hardware divide (by making the division non-constant) took 23.4 seconds, and the code as shown above took 25.6 seconds. Those figures should be taken with several grains of salt. The times are for computing i%31 for all possible values of i, on my laptop using -O3 -march=native.

gcc avoids 32-bit division by a constant by replacing it with what is essentially a 64-bit multiplication by the inverse of the constant followed by a right shift. (The actual algorithm does a bit more work to avoid overflows.) The procedure was implemented more than 20 years ago in gcc v2.6, and the paper which describes the algorithm is available on the gmp site. (GMP also uses this trick.)

Here's a simplified version: Say we want to compute n // 31 for some unsigned 32-bit integer n (using the pythonic // to indicate truncated integer division). We use the "magic constant" m = 2³² // 31, which is 138547332. Now it's clear that for any n:

m * n <= 2³² * n/31 < m * n + n ⇒ m * n // 2³² <= n//31 <= (m * n + n) // 2³²

(Here we make use of the fact that if a < b then floor(a) <= floor(b).)

Furthermore, since n < 2³², m * n // 2³² and (m * n + n) // 2³² are either the same integer or two consecutive integers. Consequently, one (or both) of those two is the actual value of n//31.

Now, we really want to compute n%31. So we need to multiply the (presumed) quotient by 31, and subtract that from n. If we use the smaller of the two possible quotients, it may turn out that the computed modulo value is too big, but it can only be too big by 31.

Or, to put it in code:

static unsigned long long magic = 138547332;
unsigned mod31g(unsigned x) {
  unsigned q = (x * magic) >> 32;
  // To multiply by 31, we multiply by 32 and subtract
  unsigned mod = x - ((q << 5) - q);
  return mod < 31 ? mod : mod - 31;
}

The actual algorithm used by gcc avoids the test at the end by using a slightly more accurate computation based on multiplying by 2³⁷//31 + 1. That always produces the correct quotient, but at the cost of some extra shifts and adds to avoid integer overflow. As it turns out, the version above is slightly faster -- in the same benchmark as above, it took only 6.3 seconds.

Other benchmarked functions, for completeness:

Naive unrolled loop

unsigned mod31b(unsigned x) {
  unsigned tmp = x & 31; x >>= 5;
  tmp += x & 31; x >>= 5;
  tmp += x & 31; x >>= 5;
  tmp += x & 31; x >>= 5;
  tmp += x & 31; x >>= 5;
  tmp += x & 31; x >>= 5;
  tmp += x & 31;

  tmp = (tmp >> 5) + (tmp & 31);
  return tmp >= 31 ? tmp - 31 : tmp;
}

@chux's improvement, slightly optimized

static const unsigned mask1 = (31U << 0) | (31U << 10) | (31U << 20) | (31U << 30);
static const unsigned mask2 = (31U << 5) | (31U << 15) | (31U << 25);
unsigned mod31c(unsigned x) {
  x = (x & mask1) + ((x & mask2) >> 5);
  x += x >> 20;
  x += x >> 10;

  x = (x & 31) + ((x >> 5) & 31);
  return x >= 31 ? x - 31: x;
}

回答2:

[Edit2] below for performance notes

An attempt with only 1 if condition.

This approach is O(log2(sizeof unsigned)). Run time would increase by 1 set of ands/shifts/add rather than twice the time with a loop approach should code use uint64_t.

unsigned mod31(uint32_t x) {
  #define m31 (31lu)
  #define m3131 ((m31 << 5) | m31)
  #define m31313131 ((m3131 << 10) | m3131)

  static const uint32_t mask1 = (m31 << 0) | (m31 << 10) | (m31 << 20) | (m31 << 30);
  static const uint32_t mask2 = (m31 << 5) | (m31 << 15) | (m31 << 25);
  uint32_t a = x & mask1;
  uint32_t b = x & mask2;
  x = a + (b >> 5);
  // x = xx 0000x xxxxx 0000x xxxxx 0000x xxxxx

  a = x & m31313131;
  b = x & (m31313131 << 20);
  x = a + (b >> 20);
  // x = 00 00000 00000 000xx xxxxx 000xx xxxxx

  a = x & m3131;
  b = x & (m3131 << 10);
  x = a + (b >> 10);
  // x = 00 00000 00000 00000 00000 00xxx xxxxx

  a = x & m31;
  b = x & (m31 << 5);
  x = a + (b >> 5);
  // x = 00 00000 00000 00000 00000 0000x xxxxx

  return x >= 31 ? x-31 : x;
}

[Edit]

The first addition method sums the individual 7 groups of five bit in parallel. Subsequent additions bring the 7 group into 4, then 2, then 1. This final 7-bit sum then proceeds to add its upper half (2-bits) to its lower half(5-bits). Code then uses one test to perform the final "mod".

This method scales for wider unsigned up to at least uint165_t log2(31+1)*(31+2). Pass that, a little more code is needed.

See @rici for some good optimizations. Still recommend using uint32_t vs. unsigned and 31UL in shifts like 31U << 15 as an unsigned 31U may only be 16 bits long. (16 bit int popular in embedded world in 2014).

[Edit2]

Besides letting the compiler use its optimizer, 2 additional techniques sped performance. These are more minor parlor tricks that yielded a modest improvement. Keep in mind YMMV and this is for a 32-bit unsigned.

Using a table look-up for the last modulo improved 10-20%. Using unsigned t table rather than unsigned char t helped a bit too. It turned out that table length, as first expected needed to be 2*31, only needed 31+5.

Using a local variable rather than always calling the function parameter surprisingly helped. Likely a weakness in my gcc compiler.

Found non-branching solutions, not shown, to replace x >= 31 ? x-31 : x. but their coding complexity was greater and performance was slower.

All-in-all, a fun exercise.

unsigned mod31quik(unsigned xx) {
  #define mask (31u | (31u << 10) | (31u << 20) | (31u << 30))
  unsigned x = (xx & mask) + ((xx >> 5) & mask);
  x += x >> 20;
  x += x >> 10;
  x = (x & 31u) + ((x >> 5) & 31u);

  static const unsigned char t[31 * 2 /* 36 */] = { 0, 1, 2, 3, 4, 5, 6,
      7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
      25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
  return t[x];
}

回答3:

int mod31(int a){
    while(a >= 31) {
        a -= 31;
    }
    return a;
};

It works if a > 0, but I doubt it will be faster than % operator.

回答4:

If you want to get the modulus of dividing by a denominator d such that d = (1 << e) - 1 where e is some exponent, you can use the fact that the binary expansion of 1/d is a repeating fraction with bits set every e digits. For example, for e = 5, d = 31, and 1/d = 0.0000100001....

Similar to rici’s answer, this algorithm effectively computes the sum of the base-(1 << e) digits of a:

uint16_t mod31(uint16_t a) {
    uint16_t b;
    for (b = a; a > 31; a = b)
        for (b = 0; a != 0; a >>= 5)
            b += a & 31;
    return b == 31 ? 0 : b;
}

You can unroll this loop, because the denominator and the number of bits in the numerator are both constant, but it’s probably better to let the compiler do that. And of course you can change 5 to an input parameter and 31 to a variable computed from that.

回答5:

You could use successive addition / subtraction. There is no other trick since 31 is a prime number to see what the modulus of a number N is mod 31 you will have to divide and find the remainder.

int mode(int number, int modulus) {
    int result = number;

    if (number >= 0) {
         while(result > modulus) { result = result - modulus;}
    } else {
         while (result < 0) { result = result + modulus;)
    }
}

来源：https://stackoverflow.com/questions/26047196/is-there-any-way-to-write-mod-31-without-modulus-division-operators

标签

bit-manipulation

bitwise-operators

modulus