Fast divisibility tests (by 2,3,4,5,.., 16)?

后端 未结 16 2047
轻奢々
轻奢々 2020-12-05 00:14

What are the fastest divisibility tests? Say, given a little-endian architecture and a 32-bit signed integer: how to calculate very fast that a number is divisible by 2,3,4,

相关标签:
16条回答
  • 2020-12-05 00:46

    In a previous question, I showed a fast algorithm to check in base N for divisors that are factors of N-1. Base transformations between different powers of 2 are trivial; that's just bit grouping.

    Therefore, checking for 3 is easy in base 4; checking for 5 is easy in base 16, and checking for 7 (and 9) is easy in base 64.

    Non-prime divisors are trivial, so only 11 and 13 are hard cases. For 11, you could use base 1024, but at that point it's not really efficient for small integers.

    0 讨论(0)
  • 2020-12-05 00:51

    First of all, I remind you that a number in the form bn...b2b1b0 in binary has value:

    number = bn*2^n+...+b2*4+b1*2+b0
    

    Now, when you say number%3, you have:

    number%3 =3= bn*(2^n % 3)+...+b2*1+b1*2+b0
    

    (I used =3= to indicate congruence modulo 3). Note also that b1*2 =3= -b1*1

    Now I will write all the 16 divisions using + and - and possibly multiplication (note that multiplication could be written as shift or sum of same value shifted to different locations. For example 5*x means x+(x<<2) in which you compute x once only)

    Let's call the number n and let's say Divisible_by_i is a boolean value. As an intermediate value, imagine Congruence_by_i is a value congruent to n modulo i.

    Also, lets say n0 means bit zero of n, n1 means bit 1 etc, that is

    ni = (n >> i) & 1;
    
    Congruence_by_1 = 0
    Congruence_by_2 = n&0x1
    Congruence_by_3 = n0-n1+n2-n3+n4-n5+n6-n7+n8-n9+n10-n11+n12-n13+n14-n15+n16-n17+n18-n19+n20-n21+n22-n23+n24-n25+n26-n27+n28-n29+n30-n31
    Congruence_by_4 = n&0x3
    Congruence_by_5 = n0+2*n1-n2-2*n3+n4+2*n5-n6-2*n7+n8+2*n9-n10-2*n11+n12+2*n13-n14-2*n15+n16+2*n17-n18-2*n19+n20+2*n21-n22-2*n23+n24+2*n25-n26-2*n27+n28+2*n29-n30-2*n31
    Congruence_by_7 = n0+2*n1+4*n2+n3+2*n4+4*n5+n6+2*n7+4*n8+n9+2*n10+4*n11+n12+2*n13+4*n14+n15+2*n16+4*n17+n18+2*n19+4*n20+n21+2*n22+4*n23+n24+2*n25+4*n26+n27+2*n28+4*n29+n30+2*n31
    Congruence_by_8 = n&0x7
    Congruence_by_9 = n0+2*n1+4*n2-n3-2*n4-4*n5+n6+2*n7+4*n8-n9-2*n10-4*n11+n12+2*n13+4*n14-n15-2*n16-4*n17+n18+2*n19+4*n20-n21-2*n22-4*n23+n24+2*n25+4*n26-n27-2*n28-4*n29+n30+2*n31
    Congruence_by_11 = n0+2*n1+4*n2+8*n3+5*n4-n5-2*n6-4*n7-8*n8-5*n9+n10+2*n11+4*n12+8*n13+5*n14-n15-2*n16-4*n17-8*n18-5*n19+n20+2*n21+4*n22+8*n23+5*n24-n25-2*n26-4*n27-8*n28-5*n29+n30+2*n31
    Congruence_by_13 = n0+2*n1+4*n2+8*n3+3*n4+6*n5-n6-2*n7-4*n8-8*n9-3*n10-6*n11+n12+2*n13+4*n14+8*n15+3*n16+6*n17-n18-2*n19-4*n20-8*n21-3*n22-6*n3+n24+2*n25+4*n26+8*n27+3*n28+6*n29-n30-2*n31
    Congruence_by_16 = n&0xF
    

    Or when factorized:

    Congruence_by_1 = 0
    Congruence_by_2 = n&0x1
    Congruence_by_3 = (n0+n2+n4+n6+n8+n10+n12+n14+n16+n18+n20+n22+n24+n26+n28+n30)-(n1+n3+n5+n7+n9+n11+n13+n15+n17+n19+n21+n23+n25+n27+n29+n31)
    Congruence_by_4 = n&0x3
    Congruence_by_5 = n0+n4+n8+n12+n16+n20+n24+n28-(n2+n6+n10+n14+n18+n22+n26+n30)+2*(n1+n5+n9+n13+n17+n21+n25+n29-(n3+n7+n11+n15+n19+n23+n27+n31))
    Congruence_by_7 = n0+n3+n6+n9+n12+n15+n18+n21+n24+n27+n30+2*(n1+n4+n7+n10+n13+n16+n19+n22+n25+n28+n31)+4*(n2+n5+n8+n11+n14+n17+n20+n23+n26+n29)
    Congruence_by_8 = n&0x7
    Congruence_by_9 = n0+n6+n12+n18+n24+n30-(n3+n9+n15+n21+n27)+2*(n1+n7+n13+n19+n25+n31-(n4+n10+n16+n22+n28))+4*(n2+n8+n14+n20+n26-(n5+n11+n17+n23+n29))
    // and so on
    

    If these values end up being negative, add it with i until they become positive.

    Now what you should do is recursively feed these values through the same process we just did until Congruence_by_i becomes less than i (and obviously >= 0). This is similar to what we do when we want to find remainder of a number by 3 or 9, remember? Sum up the digits, if it had more than one digit, some up the digits of the result again until you get only one digit.

    Now for i = 1, 2, 3, 4, 5, 7, 8, 9, 11, 13, 16:

    Divisible_by_i = (Congruence_by_i == 0);
    

    And for the rest:

    Divisible_by_6 = Divisible_by_3 && Divisible_by_2;
    Divisible_by_10 = Divisible_by_5 && Divisible_by_2;
    Divisible_by_12 = Divisible_by_4 && Divisible_by_3;
    Divisible_by_14 = Divisible_by_7 && Divisible_by_2;
    Divisible_by_15 = Divisible_by_5 && Divisible_by_3;
    

    Edit: Note that some of the additions could be avoided from the very beginning. For example n0+2*n1+4*n2 is the same as n&0x7, similarly n3+2*n4+4*n5 is (n>>3)&0x7 and thus with each formula, you don't have to get each bit individually, I wrote it like that for the sake of clarity and similarity in operation. To optimize each of the formulas, you should work on it yourself; group operands and factorize operation.

    0 讨论(0)
  • 2020-12-05 00:53

    It is not a bad idea AT ALL to figure out alternatives to division instructions (which includes modulo on x86/x64) because they are very slow. Slower (or even much slower) than most people realize. Those suggesting "% n" where n is a variable are giving foolish advice because it will invariably lead to the use of the division instruction. On the other hand "% c" (where c is a constant) will allow the compiler to determine the best algorithm available in its repertoire. Sometimes it will be the division instruction but a lot of the time it won't.

    In this document Torbjörn Granlund shows that the ratio of clock cycles required for unsigned 32-bit mults:divs is 4:26 (6.5x) on Sandybridge and 3:45 (15x) on K10. for 64-bit the respective ratios are 4:92 (23x) and 5:77 (14.4x).

    The "L" columns denote latency. "T" columns denote throughput. This has to do with the processor's ability to handle multiple instructions in parallell. Sandybridge can issue one 32-bit multiplication every other cycle or one 64-bit every cycle. For K10 the corresponding throughput is reversed. For divisions the K10 needs to complete the entire sequence before it may begin another. I suspect it is the same for Sandybridge.

    Using the K10 as an example it means that during the cycles required for a 32-bit division (45) the same number (45) of multiplications can be issued and the next-to-last and last one of these will complete one and two clock cycles after the division has completed. A LOT of work can be performed in 45 multiplications.

    It is also interesting to note that divs have become less efficient with the evolution from K8-K9 to K10: from 39 to 45 and 71 to 77 clock cycles for 32- and 64-bit.

    Granlund's page at gmplib.org and at the Royal Institute of Technology in Stockholm contain more goodies, some of which have been incorporated into the gcc compiler.

    0 讨论(0)
  • 2020-12-05 00:53

    As @James mentioned, let the compiler simplify it for you. If n is a constant, any descent compiler is able to recognize the pattern and change it to a more efficient equivalent.

    For example, the code

    #include <stdio.h>
    
    int main() {
        size_t x;
        scanf("%u\n", &x);
        __asm__ volatile ("nop;nop;nop;nop;nop;");
        const char* volatile foo = (x%3 == 0) ? "yes" : "no";
        __asm__ volatile ("nop;nop;nop;nop;nop;");
        printf("%s\n", foo);
        return 0;
    }
    

    compiled with g++-4.5 -O3, the relevant part of x%3 == 0 will become

    mov    rcx,QWORD PTR [rbp-0x8]   # rbp-0x8 = &x
    mov    rdx,0xaaaaaaaaaaaaaaab
    mov    rax,rcx
    mul    rdx
    lea    rax,"yes"
    shr    rdx,1
    lea    rdx,[rdx+rdx*2]
    cmp    rcx,rdx
    lea    rdx,"no"
    cmovne rax,rdx
    mov    QWORD PTR [rbp-0x10],rax
    

    which, translated back to C code, means

    (hi64bit(x * 0xaaaaaaaaaaaaaaab) / 2) * 3 == x ? "yes" : "no"
    // equivalatent to:                 x % 3 == 0 ? "yes" : "no"
    

    no division involved here. (Note that 0xaaaaaaaaaaaaaaab == 0x20000000000000001L/3)


    Edit:

    • The magic constant 0xaaaaaaaaaaaaaaab can be computed in http://www.hackersdelight.org/magic.htm
    • For divisors of the form 2n - 1, check http://graphics.stanford.edu/~seander/bithacks.html#ModulusDivision
    0 讨论(0)
  • 2020-12-05 00:55

    Here are some tips I haven't see anyone else suggest yet:

    One idea is to use a switch statement, or precompute some array. Then, any decent optimizer can simply index each case directly. For example:

    // tests for (2,3,4,5,6,7)
    switch (n % 8)
    {
    case 0: break;
    case 1: break;
    case 2: do(2); break;
    case 3: do(3); break;
    case 4: do(2); do(4) break;
    case 5: do(5); break;
    case 6: do(2); do(3); do(4); break;
    case 7: do(7); break;
    } 
    

    Your application is a bit ambiguous, but you may only need to check prime numbers less than n=16. This is because all numbers are factors of the current or previous prime numbers. So for n=16, you might be able to get away with only checking 2, 3, 5, 7, 11, 13 somehow. Just a thought.

    0 讨论(0)
  • 2020-12-05 00:56

    One thing to consider: since you only care about divisibility up to 16, you really only need to check divisibility by the primes up to 16. These are 2, 3, 5, 7, 11, and 13.

    Divide your number by each of the primes, keeping track with a boolean (such as div2 = true). The numbers two and three are special cases. If div3 is true, try dividing by 3 again, setting div9. Two and its powers are very simple (note: '&' is one of the fastest things a processor can do):

    if n & 1 == 0:
        div2 = true
        if n & 3 == 0: 
            div4 = true
            if n & 7 == 0: 
                div8 = true
                if n & 15 == 0:
                    div16 = true
    

    You now have the booleans div2, div3, div4, div5, div7, div8, div9, div11, div13, and div16. All other numbers are combinations; for instance div6 is the same as (div2 && div3)

    So, you only need to do either 5 or 6 actual divisions (6 only if your number is divisible by 3).

    For myself, i would probably use bits in a single register for my booleans; for instance bit_0 means div2. I can then use masks:

    if (flags & (div2+div3)) == (div2 + div3): do_6()

    note that div2+div3 can be a precomputed constant. If div2 is bit0, and div3 is bit1, then div2+div3 == 3. This makes the above 'if' optimize to:

    if (flags & 3) == 3: do_6()

    So now... mod without a divide:

    def mod(n,m):
        i = 0
            while m < n:
                m <<= 1
                i += 1
            while i > 0:
                m >>= 1
                if m <= n: n -= m
                i -= 1
         return n
    
    div3 = mod(n,3) == 0
    ...
    

    btw: the worst case for the above code is 31 times through either loop for a 32-bit number

    FYI: Just looked at Msalter's post, above. His technique can be used instead of mod(...) for some of the primes.

    0 讨论(0)
提交回复
热议问题