What are the fastest divisibility tests? Say, given a little-endian architecture and a 32-bit signed integer: how to calculate very fast that a number is divisible by 2,3,4,
In a previous question, I showed a fast algorithm to check in base N for divisors that are factors of N-1. Base transformations between different powers of 2 are trivial; that's just bit grouping.
Therefore, checking for 3 is easy in base 4; checking for 5 is easy in base 16, and checking for 7 (and 9) is easy in base 64.
Non-prime divisors are trivial, so only 11 and 13 are hard cases. For 11, you could use base 1024, but at that point it's not really efficient for small integers.
First of all, I remind you that a number in the form bn...b2b1b0 in binary has value:
number = bn*2^n+...+b2*4+b1*2+b0
Now, when you say number%3, you have:
number%3 =3= bn*(2^n % 3)+...+b2*1+b1*2+b0
(I used =3= to indicate congruence modulo 3). Note also that b1*2 =3= -b1*1
Now I will write all the 16 divisions using + and - and possibly multiplication (note that multiplication could be written as shift or sum of same value shifted to different locations. For example 5*x
means x+(x<<2)
in which you compute x
once only)
Let's call the number n
and let's say Divisible_by_i
is a boolean value. As an intermediate value, imagine Congruence_by_i
is a value congruent to n
modulo i
.
Also, lets say n0
means bit zero of n, n1
means bit 1 etc, that is
ni = (n >> i) & 1;
Congruence_by_1 = 0
Congruence_by_2 = n&0x1
Congruence_by_3 = n0-n1+n2-n3+n4-n5+n6-n7+n8-n9+n10-n11+n12-n13+n14-n15+n16-n17+n18-n19+n20-n21+n22-n23+n24-n25+n26-n27+n28-n29+n30-n31
Congruence_by_4 = n&0x3
Congruence_by_5 = n0+2*n1-n2-2*n3+n4+2*n5-n6-2*n7+n8+2*n9-n10-2*n11+n12+2*n13-n14-2*n15+n16+2*n17-n18-2*n19+n20+2*n21-n22-2*n23+n24+2*n25-n26-2*n27+n28+2*n29-n30-2*n31
Congruence_by_7 = n0+2*n1+4*n2+n3+2*n4+4*n5+n6+2*n7+4*n8+n9+2*n10+4*n11+n12+2*n13+4*n14+n15+2*n16+4*n17+n18+2*n19+4*n20+n21+2*n22+4*n23+n24+2*n25+4*n26+n27+2*n28+4*n29+n30+2*n31
Congruence_by_8 = n&0x7
Congruence_by_9 = n0+2*n1+4*n2-n3-2*n4-4*n5+n6+2*n7+4*n8-n9-2*n10-4*n11+n12+2*n13+4*n14-n15-2*n16-4*n17+n18+2*n19+4*n20-n21-2*n22-4*n23+n24+2*n25+4*n26-n27-2*n28-4*n29+n30+2*n31
Congruence_by_11 = n0+2*n1+4*n2+8*n3+5*n4-n5-2*n6-4*n7-8*n8-5*n9+n10+2*n11+4*n12+8*n13+5*n14-n15-2*n16-4*n17-8*n18-5*n19+n20+2*n21+4*n22+8*n23+5*n24-n25-2*n26-4*n27-8*n28-5*n29+n30+2*n31
Congruence_by_13 = n0+2*n1+4*n2+8*n3+3*n4+6*n5-n6-2*n7-4*n8-8*n9-3*n10-6*n11+n12+2*n13+4*n14+8*n15+3*n16+6*n17-n18-2*n19-4*n20-8*n21-3*n22-6*n3+n24+2*n25+4*n26+8*n27+3*n28+6*n29-n30-2*n31
Congruence_by_16 = n&0xF
Or when factorized:
Congruence_by_1 = 0
Congruence_by_2 = n&0x1
Congruence_by_3 = (n0+n2+n4+n6+n8+n10+n12+n14+n16+n18+n20+n22+n24+n26+n28+n30)-(n1+n3+n5+n7+n9+n11+n13+n15+n17+n19+n21+n23+n25+n27+n29+n31)
Congruence_by_4 = n&0x3
Congruence_by_5 = n0+n4+n8+n12+n16+n20+n24+n28-(n2+n6+n10+n14+n18+n22+n26+n30)+2*(n1+n5+n9+n13+n17+n21+n25+n29-(n3+n7+n11+n15+n19+n23+n27+n31))
Congruence_by_7 = n0+n3+n6+n9+n12+n15+n18+n21+n24+n27+n30+2*(n1+n4+n7+n10+n13+n16+n19+n22+n25+n28+n31)+4*(n2+n5+n8+n11+n14+n17+n20+n23+n26+n29)
Congruence_by_8 = n&0x7
Congruence_by_9 = n0+n6+n12+n18+n24+n30-(n3+n9+n15+n21+n27)+2*(n1+n7+n13+n19+n25+n31-(n4+n10+n16+n22+n28))+4*(n2+n8+n14+n20+n26-(n5+n11+n17+n23+n29))
// and so on
If these values end up being negative, add it with i
until they become positive.
Now what you should do is recursively feed these values through the same process we just did until Congruence_by_i
becomes less than i
(and obviously >= 0
). This is similar to what we do when we want to find remainder of a number by 3 or 9, remember? Sum up the digits, if it had more than one digit, some up the digits of the result again until you get only one digit.
Now for i = 1, 2, 3, 4, 5, 7, 8, 9, 11, 13, 16
:
Divisible_by_i = (Congruence_by_i == 0);
And for the rest:
Divisible_by_6 = Divisible_by_3 && Divisible_by_2;
Divisible_by_10 = Divisible_by_5 && Divisible_by_2;
Divisible_by_12 = Divisible_by_4 && Divisible_by_3;
Divisible_by_14 = Divisible_by_7 && Divisible_by_2;
Divisible_by_15 = Divisible_by_5 && Divisible_by_3;
Edit: Note that some of the additions could be avoided from the very beginning. For example n0+2*n1+4*n2
is the same as n&0x7
, similarly n3+2*n4+4*n5
is (n>>3)&0x7
and thus with each formula, you don't have to get each bit individually, I wrote it like that for the sake of clarity and similarity in operation. To optimize each of the formulas, you should work on it yourself; group operands and factorize operation.
It is not a bad idea AT ALL to figure out alternatives to division instructions (which includes modulo on x86/x64) because they are very slow. Slower (or even much slower) than most people realize. Those suggesting "% n" where n is a variable are giving foolish advice because it will invariably lead to the use of the division instruction. On the other hand "% c" (where c is a constant) will allow the compiler to determine the best algorithm available in its repertoire. Sometimes it will be the division instruction but a lot of the time it won't.
In this document Torbjörn Granlund shows that the ratio of clock cycles required for unsigned 32-bit mults:divs is 4:26 (6.5x) on Sandybridge and 3:45 (15x) on K10. for 64-bit the respective ratios are 4:92 (23x) and 5:77 (14.4x).
The "L" columns denote latency. "T" columns denote throughput. This has to do with the processor's ability to handle multiple instructions in parallell. Sandybridge can issue one 32-bit multiplication every other cycle or one 64-bit every cycle. For K10 the corresponding throughput is reversed. For divisions the K10 needs to complete the entire sequence before it may begin another. I suspect it is the same for Sandybridge.
Using the K10 as an example it means that during the cycles required for a 32-bit division (45) the same number (45) of multiplications can be issued and the next-to-last and last one of these will complete one and two clock cycles after the division has completed. A LOT of work can be performed in 45 multiplications.
It is also interesting to note that divs have become less efficient with the evolution from K8-K9 to K10: from 39 to 45 and 71 to 77 clock cycles for 32- and 64-bit.
Granlund's page at gmplib.org and at the Royal Institute of Technology in Stockholm contain more goodies, some of which have been incorporated into the gcc compiler.
As @James mentioned, let the compiler simplify it for you. If n
is a constant, any descent compiler is able to recognize the pattern and change it to a more efficient equivalent.
For example, the code
#include <stdio.h>
int main() {
size_t x;
scanf("%u\n", &x);
__asm__ volatile ("nop;nop;nop;nop;nop;");
const char* volatile foo = (x%3 == 0) ? "yes" : "no";
__asm__ volatile ("nop;nop;nop;nop;nop;");
printf("%s\n", foo);
return 0;
}
compiled with g++-4.5 -O3, the relevant part of x%3 == 0
will become
mov rcx,QWORD PTR [rbp-0x8] # rbp-0x8 = &x
mov rdx,0xaaaaaaaaaaaaaaab
mov rax,rcx
mul rdx
lea rax,"yes"
shr rdx,1
lea rdx,[rdx+rdx*2]
cmp rcx,rdx
lea rdx,"no"
cmovne rax,rdx
mov QWORD PTR [rbp-0x10],rax
which, translated back to C code, means
(hi64bit(x * 0xaaaaaaaaaaaaaaab) / 2) * 3 == x ? "yes" : "no"
// equivalatent to: x % 3 == 0 ? "yes" : "no"
no division involved here. (Note that 0xaaaaaaaaaaaaaaab == 0x20000000000000001L/3
)
Edit:
Here are some tips I haven't see anyone else suggest yet:
One idea is to use a switch
statement, or precompute some array. Then, any decent optimizer can simply index each case directly. For example:
// tests for (2,3,4,5,6,7)
switch (n % 8)
{
case 0: break;
case 1: break;
case 2: do(2); break;
case 3: do(3); break;
case 4: do(2); do(4) break;
case 5: do(5); break;
case 6: do(2); do(3); do(4); break;
case 7: do(7); break;
}
Your application is a bit ambiguous, but you may only need to check prime numbers less than n=16. This is because all numbers are factors of the current or previous prime numbers. So for n=16, you might be able to get away with only checking 2, 3, 5, 7, 11, 13
somehow. Just a thought.
One thing to consider: since you only care about divisibility up to 16, you really only need to check divisibility by the primes up to 16. These are 2, 3, 5, 7, 11, and 13.
Divide your number by each of the primes, keeping track with a boolean (such as div2 = true). The numbers two and three are special cases. If div3 is true, try dividing by 3 again, setting div9. Two and its powers are very simple (note: '&' is one of the fastest things a processor can do):
if n & 1 == 0:
div2 = true
if n & 3 == 0:
div4 = true
if n & 7 == 0:
div8 = true
if n & 15 == 0:
div16 = true
You now have the booleans div2, div3, div4, div5, div7, div8, div9, div11, div13, and div16. All other numbers are combinations; for instance div6 is the same as (div2 && div3)
So, you only need to do either 5 or 6 actual divisions (6 only if your number is divisible by 3).
For myself, i would probably use bits in a single register for my booleans; for instance bit_0 means div2. I can then use masks:
if (flags & (div2+div3)) == (div2 + div3): do_6()
note that div2+div3 can be a precomputed constant. If div2 is bit0, and div3 is bit1, then div2+div3 == 3. This makes the above 'if' optimize to:
if (flags & 3) == 3: do_6()
So now... mod without a divide:
def mod(n,m):
i = 0
while m < n:
m <<= 1
i += 1
while i > 0:
m >>= 1
if m <= n: n -= m
i -= 1
return n
div3 = mod(n,3) == 0
...
FYI: Just looked at Msalter's post, above. His technique can be used instead of mod(...) for some of the primes.