问题
I have been trying to learn algorithms for generating prime numbers and I came across Sieve of Atkin on Wikipedia. I understand almost all parts of the algorithm, except a few. Here are the questions:
- How are the three quadratic equations below formed? 4x^2+y^2, 3x^2+y^2 and 3x^2-y2
- The algorithm in wikipedia talks about modulo sixty but I dont understand how/where that is used in the psudocode below.
- How are these reminders 1,5,7 and 11 found?
Below is the pseudocode from Wikipedia for reference:
// arbitrary search limit
limit ← 1000000
// initialize the sieve
for i in [5, limit]: is_prime(i) ← false
// put in candidate primes:
// integers which have an odd number of
// representations by certain quadratic forms
for (x, y) in [1, √limit] × [1, √limit]:
n ← 4x²+y²
if (n ≤ limit) and (n mod 12 = 1 or n mod 12 = 5):
is_prime(n) ← ¬is_prime(n)
n ← 3x²+y²
if (n ≤ limit) and (n mod 12 = 7):
is_prime(n) ← ¬is_prime(n)
n ← 3x²-y²
if (x > y) and (n ≤ limit) and (n mod 12 = 11):
is_prime(n) ← ¬is_prime(n)
// eliminate composites by sieving
for n in [5, √limit]:
if is_prime(n):
// n is prime, omit multiples of its square; this is
// sufficient because composites which managed to get
// on the list cannot be square-free
is_prime(k) ← false, k ∈ {n², 2n², 3n², ..., limit}
print 2, 3
for n in [5, limit]:
if is_prime(n): print n
回答1:
The Sieve of Atkin pseudo code from the Wikipedia article you've quoted contains the answers to your questions or the discussion about the article for which Will Ness has provided a link does, although you may not be able to put the information together. Short answers are as follows:
The three equations come from Atkin's mathematical proof that all primes will occur as the solutions of one or more of those three equations with the appropriate modulo conditions for all valid values of the variables 'x' and 'y', and in fact he further proved that true primes will be those numbers that have an odd number of solutions to those three equations (left as True when toggled an odd number of times from initial conditions of False) with the modulo conditions for each excluding those numbers which are divisible by squares of found primes less than or equal to the square root of the tested number.
The true Sieve of Atkin is based on a set of modulo 60 conditions; this pseudo code represents a simplification where there are less ranges of conditions for each equation using a set of modulo 12 conditions (5 × 12 = 60) - however, this results in there being 20% extra work being done because of introduced redundancy in this new set of conditions. It is also the reason that this simplified pseudo code needs to start its prime scan at 5 rather than the normal 7 and to do the base primes squares free eliminations starting at a base prime of 5 at an added cost in execution time as the factors of 5 are not otherwise handled properly. The reason for this simplification is perhaps to sacrifice some extra code overhead in order to ease the code complexity in having to check values, although this can be done with a single table look up whereas the extra over 30% in work due to using modulo 12 can not be reduced.
The "reminders" (should be spelled remainders) are found by the 'mod' operators in the pseudo code you quoted which stand for the modulo operator in whatever computer language one might use, often represented by the "%" symbol in computer languages such as C, Java, C#, F#, et cetera. This operator yields the integer remainder after an integer division of the dividend (the first of the numbers, before the 'mod') by the divisor (the second of the numbers, after the 'mod' symbol). The reason that the remainders are just four rather than the 16 as used in the full modulo 60 algorithm is due to the simplifications of reducing to a modulo 12 algorithm. You'll notice that with the modulo 12 conditions, the "4x" condition passes 25 which would normally be eliminated by the modulo 60 conditions, thus many composites containing 25 as a factor need to be eliminated by the additional prime of 5 square free check. Similarly, 55 passes the "3x+" check and 35 passes the "3x-" check where they wouldn't for the full modulo 60 algorithm.
As discussed in the Wikipedia article "Talk" section and hinted above, this quoted pseudo code is never much faster than even basic odds-only Sieve of Eratosthenes (SoE) implementations let alone one that uses the same degree of wheel factorization due to its inefficiencies: the variables 'x' and 'y' do not need to range over the full range to the square root of the range sieved for many cases (mentioned in the article), the proper use of the modulo 60 wheel recovers the redundancies in the use of the modulo 12 simplification (as I mention above), and there are modulo lattice patterns in the solutions to the quadratic equations such that the conditions using the (computationally slow) modulo operations need not be tested by the use of an algorithm that automatically skips those solutions that would not satisfy those modulo conditions according to the lattice patterns (only mentioned very obscurely in the full Atkin and Bernstein paper).
To answer a question you didn't ask but should have: "Why use the Sieve of Atkin?".
The main reason that the Sieve of Atkin (SoA) is implemented rather than the Sieve of Eratosthenes (SoE) is that it is "Internet Knowledge" that the SOA is faster. There are two reasons for this belief, as follows:
The SoA is assumed to be faster because the asymptotic computational complexity is less for it than for the SoE by a factor of log(log n) where n is the range of primes sieved. Practically speaking, going from a range of two to the power 32 (four billion plus) to two to the power 64 (about 2 followed by 19 zeros), this factor is six over five equals 1.2. That means that the true SoE should take 1.2 times as long as expected by just a linear ratio when sieving to the 64-bit number range as compared to the 32-bit number range, whereas the SoA will have a linear relationship if all were ideal. You can appreciate that, first, this is a very small factor for such a huge difference in number ranges and, second, that this benefit only holds if the two algorithms had the same or close to the same performance at some reasonable range of primes.
What isn't clearly understood by that "Internet knowledge" is that these figures only apply when one is comparing the ratio of performance over a given range as compared to another given range for the same algorithm, not as a comparison between different algorithms. Thus it is useless as a proof that the SoA will be faster than the SoE since the SoA could start with a larger overhead for a given sieve range of a particular SoE algorithm as in the following optimized SoE example.
The SoA is believed to be faster because of the computational comparison made and published by Atkin and Bernstein as per their paper linked in the Wikipedia article. While the work is accurate, it only applies to the artificial limitations that they made on the SoE algorithm they compared: Since the SoA algorithm is based on modulo 60 factorization which represents two 2,3,5 factorization wheel rotations, they also limited the SoE algorithm to that same wheel factorization. Doing this, the SoE does about 424,000 composite number cull operations over the one billion range tested, whereas a fully optimized SoA as tested has about 326,000 combined toggle and square free cull operations, which each take about the same time as each composite number cull operation in the SoE due to being written in the same style. This means that the SoA is about 30% faster than the SoE for this particular set of wheel factorization conditions, and that is just about exactly what the Atkins and Bernstein comparison test showed.
However, the SoA does not respond to further levels of wheel factorization as the 2,3,5 level is "baked-in" to the algorithm, whereas the SoE does respond to further levels: using a 2,3,5,7 wheel factorization results in about the same number of operations meaning about the same performance for each. One can use even a partial higher level of wheel factorization than the 2,3,5,7 level to get the number of operations for SoE about 16.7% less than SoA, for a proportional better performance. The optimizations required to implement these extra levels of wheel factorization are actually simpler than the code complexity to implement the original optimized SoA. The memory footprint for implementing these comparable page segmented implementations is about the same, being the size of the page buffers plus the base primes array.
In addition, both would benefit from being written in a "machine state look up" style which would help in better optimization using inlined code and extreme loop unrolling but the SoE befits more from this style than the SoA due to being a simpler algorithm.
So for sieve ranges up to about the 32-bit number range, the maximally optimized SoE is about 20% faster (or more with further wheel factorization) than the SoA; however, the SoA has this asymptotic computational complexity advantage, so there will be some point at which it catches up. That point will be at about the range where the ratio of log (log n) to log (log (2^32)) or 5 is 1.2 or a range of about 2 times ten to the nineteenth power - an exceedingly large number. If the optimized SoA run on a modern desktop computer were to take about a third of a second to compute the primes in the 32-bit number range, and if the implementation were ideal as in taking linear time increase with increasing range, it would take about 45 years to compute the primes to this range. However, analysis of prime numbers in these high ranges is often done in small chunks, for which the use of the SoA would be theoretically practical as compared to the SoE for very large number sieves, but by a very small factor.
EDIT_ADD: In actual fact, neither the page segmented SoE nor the SoA continue to run in linear time for these huge ranges as compared to the lower ranges as both run into problems with the "toggling" and "culling" operations losing efficiency due to having to skip large numbers of pages per jump; this means that both will require modified algorithms to handle this "page jumping" and the very slight advantage to the SoA may be completely erased if there are any slight differences in the way that this is done. The SoA has many more terms in the "jump tables" than the SoE by about the inverse ratio between of the number of primes found up to the square root of the range to that range, and this will likely add a O(log n) term to both in processing but a constant factor larger increase for the SoA which has a higher number of entries in the "jump table". This extra fact will pretty much completely cancel out any advantage of the SoA over the SoE, even for extremely large ranges. Further, the SoA has a constant overhead of more individual loops required for the many more cases in implementing the conditions for the three separate quadratic equations plus the "prime squares free" loop instead of just a primes culling loop so can never have as low an average computational time per operation as the SoE when fully optimized. END_EDIT_ADD
EDIT_ADD2: It is my opinion that much of the confusion about the Sieve of Atkin is due to the deficiencies of the pseudo code from the Wikipedia article as quoted in the question, so have come up with a somewhat better version of the pseudo code that addresses at least some of the deficiencies to do with the range of the 'x' and 'y' variables and the modulo 12 versus modulo 60 confusion as follows:
limit ← 1000000000 // arbitrary search limit
// Initialize the sieve
for i in {7,11,13,17,19,23,29,31, 37,41,43,47,49,53,59,61,...}:
is_prime(i) ← false
// Put in candidate primes:
// integers which have an odd number of
// representations by certain quadratic forms.
while n ≤ limit, n ← 4x²+y² where x ∈ {1,2,...} and y ∈ {1,3,...} // odd y's
if n mod 60 ∈ {1,13,17,29,37,41,49,53}:
is_prime(n) ← ¬is_prime(n)
while n ≤ limit, n ← 3x²+y² where x ∈ {1,3,...} and y ∈ {2,4,...} // only odd
if n mod 60 ∈ {7,19,31,43}: // x's and even y's
is_prime(n) ← ¬is_prime(n)
while n ≤ limit, n ← 3x²-y² where x ∈ {2,3,...} and y ∈ {x-1,x-3,...,1} //all
if n mod 60 ∈ {11,23,47,59}: // even/odd odd/even combos
is_prime(n) ← ¬is_prime(n)
// Eliminate composites by sieving, only for those occurrences on the wheel
for n² ≤ limit where n ∈ {7,11,13,17,19,23,29,31, 37,41,43,47,49,53,59,61,...}:
if is_prime(n):
// n is prime, omit multiples of its square; this is
// sufficient because composites which managed to get
// on the list cannot be square-free
while c ≤ limit, c ← n² × k where
k ∈ {1,7,11,13,17,19,23,29, 31,37,41,43,47,49,53,59,...}:
is_prime(c) ← false
output 2, 3, 5
for n ≤ limit when n ← 60 × k + x where
k ∈ {0..} and
x ∈ {7,11,13,17,19,23,29,31, 37,41,43,47,49,53,59,61}:
if is_prime(n): output n
The above seems quite simple and is quite a good algorithm except that it still isn't faster than a basic Sieve of Eratosthenes that uses the same 2;3;5 factorization wheel because it wastes almost half of its inner loop toggle operations that fail the modulo tests. To demonstrate:
Following is the repeating 4x²+y² pattern of qualifying modulo 60's across a range of 15 values of 'x' (every value) vertically and 15 odd values of 'y' horizontally; both starting at one. Notice that they are symmetrical about a vertical axis but only 128 out of 225 or 256 out of 450 for a full 30 number range of 'x' values are valid toggle operations:
0 13 29 53 0 0 53 49 53 0 0 53 29 13 0
17 0 41 0 37 17 0 1 0 17 37 0 41 0 17
37 0 1 0 0 37 0 0 0 37 0 0 1 0 37
0 13 29 53 0 0 53 49 53 0 0 53 29 13 0
41 49 0 29 1 41 29 0 29 41 1 29 0 49 41
0 0 49 13 0 0 13 0 13 0 0 13 49 0 0
17 0 41 0 37 17 0 1 0 17 37 0 41 0 17
17 0 41 0 37 17 0 1 0 17 37 0 41 0 17
0 0 49 13 0 0 13 0 13 0 0 13 49 0 0
41 49 0 29 1 41 29 0 29 41 1 29 0 49 41
0 13 29 53 0 0 53 49 53 0 0 53 29 13 0
37 0 1 0 0 37 0 0 0 37 0 0 1 0 37
17 0 41 0 37 17 0 1 0 17 37 0 41 0 17
0 13 29 53 0 0 53 49 53 0 0 53 29 13 0
1 0 0 49 0 1 49 0 49 1 0 49 0 0 1
Following is the repeating 3x²+y² pattern of qualifying modulo 60's across a range of 5 odd values of 'x' vertically and 15 even values of 'y' horizontally. Notice that they are symmetrical about a horizontal axis but only 48 out of 75 or 144 out of 450 for a full 30 number range of 'x' values are valid toggle operations as all 144 out of 450 invalid operations with even 'x' and odd 'y' have already been eliminated:
7 19 0 7 43 0 19 19 0 43 7 0 19 7 0
31 43 0 31 7 0 43 43 0 7 31 0 43 31 0
19 31 0 19 0 0 31 31 0 0 19 0 31 19 0
31 43 0 31 7 0 43 43 0 7 31 0 43 31 0
7 19 0 7 43 0 19 19 0 43 7 0 19 7 0
Following is the repeating 3x²-y² pattern of qualifying modulo 60's across a range of 5 odd values of 'x' vertically and 15 even values of 'y' horizontally. Notice that they are symmetrical about a horizontal axis but only 48 out of 75 or 224 out of 450 for a full 30 number range of 'x' values are valid toggle operations:
59 47 0 59 23 0 47 47 0 23 59 0 47 59 0
23 11 0 23 47 0 11 11 0 47 23 0 11 23 0
11 59 0 11 0 0 59 59 0 0 11 0 59 11 0
23 11 0 23 47 0 11 11 0 47 23 0 11 23 0
59 47 0 59 23 0 47 47 0 23 59 0 47 59 0
Following is the repeating 3x²-y² pattern of qualifying modulo 60's across a range of 5 even values of 'x' vertically and 15 odd values of 'y' horizontally. Notice that they are symmetrical about a vertical axis but only 48 out of 75 or 224 out of 450 for a full 30 number range of 'x' values are valid toggle operations:
11 0 47 23 0 11 23 0 23 11 0 23 47 0 11
47 0 23 59 0 47 59 0 59 47 0 59 23 0 47
47 0 23 59 0 47 59 0 59 47 0 59 23 0 47
11 0 47 23 0 11 23 0 23 11 0 23 47 0 11
59 0 0 11 0 59 11 0 11 59 0 11 0 0 59
As one can determine by inspection of the above tables, for the pseudo code as above there is an overall repeating range of 30 values of x (including both evens and odds) which only has 688 valid operations out of a total of 1125 combined operations; thus it wastes almost half of its processing in just conditionally skipping over values due to the non-productive skipping operations being part of the innermost loops.There are two possible methods to eliminate that "hit" redundancy inefficiently, as follows:
The method as outlined in the Atkin and Bernstein paper, which uses the recognized patterns in the combined modulos of 'x' and 'y' to process only the modulo "hits" using the fact that once one locates a given modulo on the pattern, then there are an infinite sequence of values at that some modulo (equals wheel bit position) where each pattern is separated by an easily calculated (variable) offset which has the form "current position plus A times the current value of 'x' plus B" and "current position plus C times the current value of 'y' plus D" where A, B, C, and D, are simple constants (simple meaning small easily manipulated). Bernstein used that method with the additional help of many Look Up Tables (LUTs).
The method of creating overall pattern state Look Up Tables (LUTs) (one for each of the four tables above plus one for the minor prime square free loop) indexed by the modulo current values of 'x' combined with the modulo of 'y' to find the A, B, C, and D parameters to skip, not to the next pattern, but just to the next position on the horizontal sequence. This is likely the more highly performance option as it more easily allows for extreme clock cycle per operation reduction using in-lining of the unrolled loop code and will produce more efficient code for large ranges when page segmenting as the jumps per loop are smaller on average. This will likely make the clock cycles per loop close to that of a highly optimized Sieve of Eratosthenes as in primesieve, but will not likely get to quite that low due to having to compute the variable step sizes rather than being able to use fixed prime offsets for SoE.
So there are three governing objectives in reducing the running time for a prime sieve, as follows:
A successful sieve reduces the number of operations, which even the "hit optimized" SoA fails as compared to a highly wheel factorized SoE by about 16.7% for ranges of about a few billion.
A successful sieve reduces the CPU clock cycles per operation, which the SoA fails as compared to a highly optimized SoE such as primesieve because its operations are more complex due to the variable increments, again likely by 20% to 60%. Atkin and Bernstein's primegen (SoA) takes about 4.5 as compared to about 2.5 clock cycles per operation for primesieve (SoE) for a range of one billion for each, but perhaps the SoA could be sped up somewhat by borrowing some of the optimization techniques from primesieve.
A successful sieve has reasonably low code complexity so that it can be more easily extended to larger ranges using such techniques as "bucket sieving" and other page segmentation optimizations. For this the Sieve of Atkin fails miserably as it gets exponentially more complex for page segmenting large number ranges. It is extremely complex to write a SoA program that would compete with even Bernstein's primegen (SoA) let alone with primesieve, whereas it is quite easy to write code that comes close to the same performance as primesieve.
A successful sieve has a lower empirical complexity, which the SoA does have above the SoE by a factor of (log (log n) where n is the upper range to be sieved, but that extra small ratio is not likely ever enough to make up for the above top two combined loss ratios, as this factor gets smaller with increasing range. END_EDIT_ADD2
So the answer to the question "Why use the Sieve of Atkin?" is "There is no reason to use it at all if the SoE is implemented with maximum wheel factorization optimizations until the numbers sieved are extremely large (64-bit number range and higher) and then the SoA advantage is very small and perhaps not realizable at all depending on very minor tweaks in relative optimizations.".
As an answer to another similar Sieve of Atkin question, I have posted a page segmented C# version of the SoE optimized as per the above at: https://stackoverflow.com/a/20559687/549617.
回答2:
In my original answer to this question, in order to assist in better understanding of the Sieve of Atkin alogithm I extended the Wikipedia Sieve of Atkin (SoA) pseudo code to correct some of the deficiencies in the simplified "modulo 12" algorithm given, which many find confusing as compared to the full "modulo 60" algorithm. As well, the original Wikipedia pseudo code algorithm leads to inefficient implementations; specifically, the ranges of the 'x' and 'y' variables are each wider than necessary and the modulo 12 simplification results in in an additional 20% redundant work. However, as discussed in that answer, that pseudo code is still not efficient as the modulo tests are still in the innermost quadratic solution loops and thus almost half of those solution loops are unproductive for those solutions that do not pass the modulo tests; this means that a program implementing that pseudo code would likely be almost twice as slow as necessary even when there is no memory access speed bottleneck.
The algorithm as used by Atkin and Bernstein in their "primegen" program to avoid that loop redundancy can be better understood by a progression through the following observations based on the "hit" tables in my previous answer:
The symmetry of the "hit" tables is dependent on whether 'y' is odd (vertical symmetry) or even (horizontal symmetry).
Thus, if we wanted the advance of the patterns to proceed in the horizontal direction as shown there, we can flip the order of the 'x' and 'y' loops for the "3x+" and "3x-;odd x -> even y" cases meaning odd 'x' cases.
Now it's easy to eliminate the zero cases for the two flipped "3x" tables that only have five repetitions horizontally by a condition in an outer loop as those cases where "y mod 3 equals 0" plus the three extra cases only for the middle "axis" column where "y mod 5 equals 0".
For the last "3x-;even x -> odd y" case, it is easy to eliminate most of the zero's by just filtering those cases where "(y + 2) mod 3 equals 0" plus the extra two cases only for the last line where the "(x mod 60) equals 59" where "(y + 2) mod 5 equals 0" (duplicate eliminations for the symmetrical vertical axis column, which is always eliminated). Although these "y tests" would appear to have to happen in an inner loop, as there are only five "hits" per side of the vertical symmetry axis, these can be inline coded by the 'x' loop.
The most difficult eliminations are for the "4x" table where the pattern is more complex: there are five distinct horizontal patterns where each can be recognized because the "(x mod 60) for the first column" is distinct, with one of the leading zero row patterns in the first table above having a modulo of 5 and the other a modulo of 25; now the five different patterns can be determined by "inlining" the cases as offsets on either side of the vertical symmetrical axis for each pattern.
The spacing to the next row is exactly 4 × ((x + 1)² - x²) or 8x + 4 for the first table, (y + 2)² - y² or 4y + 4 for the second two (swapped) new tables, and 3 × ((x + 2)² - x²) or 12x + 12 for the last table.
For all (now) vertically symmetrical tables, the centre axis offset can be determined from the offset of the first column as either (y + 16)² - y² or 32y + 256 for the long rows and 3 × ((x + 4)² - x²) or 24x + 48 for the short rows.
In a similar way, the symmetrical offsets on either side of the vertical axis can easily be calculated from the offset of the vertical axis, for instance by the pair (y + 2)² - y² or 4 + 4x with (y - 2)² - y² or 4 - 4y where y is the centre axis offset for the long row cases and plus and minus two positions; the case for the short row 'x' cases is similar but just multiplied by three for the overall scale of the "3x" 'x' values.
The above filters determined by horizontal variable conditions would seem to break our objective of eliminating conditional code in inner loops, but that rule does not apply when determining the pattern is not the inner loop but rather starts a whole series of new inner loops across all of the patterns horizontally in the array for each valid element of the pattern. This works because we can show mathematically that each of the same equivalent position in the next pattern (which has the same modulo) are separated as follows: For horizontal pattern advances by 30 for the long row cases, the patterns are separated by (x + 30)² - x² or 60x + 900 so 60 × (x + 15); for the horizontal pattern advances by 10 for the short row cases, the patterns are separated by 3 × ((x + 10)² - x²) so 60 * (x + 5). Since both have a modulo 60 of zero, this is the exact explanation of why there are repeating patterns of the same modulo in every position. Thus, once we find the offset for each top left corner position, we can determine the modulo and offset for the starting locations for the non-zero pattern positions and just roll an inner loop using the pattern relationships from that point to the sieve array limit.
All of the above column and row increments can be done without using a "divide" operation using modulo addition where the algorithm keeps track of the pairs of quotients and modulo's of 60 with just an "overflow check and adjust" operation on the modulo/quotient pair after every operation, but the conditional code to do the "check and adjust" is likely more computationally expensive than the way as written, for which most modern optimizing compilers will generate a multiply operation using overflow truncation characteristics to replace the computationally expensive division operation for division by small integers, which usually takes less time than the combination of operations plus the conditional code required for the "check and adjust" method or using a direct divide operation.
The above sets of eliminations work very well for a packed bit representation of the candidate primes in that each of the 16 valid modulo values can be represented as one bit in a 16-bit word and the word index presents the the indexes of each modulo 60 wheel; thus we can advance by even increments evenly divisible by 60 by just advancing an integer number of 16-bit words.
For use as a non-page segmented algorithm as for the pseudo code here, it is sufficient to just move the modulo checks up out of the innermost loop without eliminating all of the "hit misses" for the resulting middle loop, as although they still are not maximally efficient in that the middle loop still processes "hit misses", that doesn't add more than about one percent to the run time; however, for paged implementations, "missed hits" even in the outer loops can add greatly to the computational overhead as these loops are run once per page over the range of 'x' values for that page so there is a factor of about the logarithm of the square root of the sieved range more of the outer loops for the SoA as compared to the Sieve of Eratosthenes (SoE). A version of more complex pseudo code that represents a (optionally bit packed) non-page-segmented version of the "hit optimized" Atkin and Bernstein method can then be written as follows:
// arbitrary search limit
limit ← 1000000000
// the set of base wheel prime candidates
s ∈ {7,11,13,17,19,23,29,31, 37,41,43,47,49,53,59,61}
// Initialize the sieve as an array of wheels with
// enough wheels to include the representation for limit
for n ← 60 * w + x where w ∈ {0,...(limit - 7) ÷ 60}, x in s:
sieve(n) ← false // start with no primes.
// Put in candidate primes:
// integers which have an odd number of
// representations by certain quadratic forms.
while n ≤ limit when n ← 4x²+y² // all x's and only odd y's
where x ∈ {1,...}, y ∈ {1,31,...}: // skip by pattern width in y
for cb ≤ limit when midy ← y + i, cb ← n - y² + midy²
where i ∈ {0,2,...28}: // middle level loop to "hit" test modulos
cbd ← (cb - 7) ÷ 60, cm ← cb mod 60
if cm in {1,13,17,29,37,41,49,53}:
while c ≤ limit when c ← 60 × (cbd - midy² + ((midy + 15 × j) × j)²) + cm
where j {0,...}:
sieve(c) ← ¬sieve(c) // toggles the affected bit
while n ≤ limit when n ← 3x²+y² // only odd x's and even y's
where x ∈ {1,3,...}, y ∈ {2,32,...}: // skip by pattern width in y
for cb ≤ limit when midy ← y + i, cb ← n - y² + midy²
where i ∈ {0,2,...28}: // middle level loop to "hit" test modulos
cbd ← (cb - 7) ÷ 60, cm ← cb mod 60
if cm in {7,19,31,43}:
while c ≤ limit when c ← 60 × (cbd - midy² + ((midy + 15 × j) × j)²) + cm
where j {0,...}:
sieve(c) ← ¬sieve(c) // toggles the affected bit
while n ≤ limit when n ← 3x²-y² //even/odd & odd/even combos
where x ∈ {2,3,...}, y ∈ {x-1,x-31,...,1}: // skip by pattern width in y
for midy ≥ 1 and cb ≤ limit when midy ← y - i, cb ← n + y² - midy²
where i ∈ {0,2,...28}: // middle level loop to "hit" test modulos
cbd ← (cb - 7) ÷ 60, cm ← cb mod 60
if cm in {11,23,47,59}:
while yn ≥ 1 and c ≤ limit when yn = midy - 30 * j and
c ← 60 × (cbd + midy² - ((midy - 15 × j) × j)²) + cm
where j {0,...}:
sieve(c) ← ¬sieve(c) // toggles the affected bit
// Eliminate prime squares by sieving, only for those occurrences on the wheel
for sqr ≤ limit where sqr ← n², n in {7,11,13,17,19,23,29,31, 37,41,43,47,49,53,59,61,...}:
if is_prime(n):
// n is prime, omit multiples of its square; this is
// sufficient because composites which managed to get
// on the list cannot be square-free
if c0 ≤ limit where c0 ← sqr × (m mod 60) where m in s:
c0d ← (c0 - 7) ÷ 60, cm ← (c0 - 7) mod 60 + 7 // means cm in s
while c ≤ limit for c ← 60 × (c0d + sqr × j) + cm
where j ∈ {0,...}:
sieve(c) ← false // cull composites of this prime square
output 2, 3, 5
for n ≤ limit when n ← 60 × k + x where k ∈ {0..}, x in s:
if sieve(n): output n
In particular, notice how the innermost quadratic loops are very simple but increment by a linearly increasing variable amount per loop, whereas the prime squares free elimination, which uses a type of Eratosthenes progression, increments by a fixed amount "sqr". These simple loops will execute quite quickly on a modern CPU, taking about 4 CPU clock cycles per loop if memory access is fast enough as for small sieve ranges or as in using a page segmented version where the pages fit within the L2/L1 CPU caches that have memory access times of about this range.
When the array is large as in many Megabytes, then the average memory access time is likely to average something between 20 and over a 100 CPU clock cycles (depending on the speed of RAM, efficiency of memory access by the CPU, and efficiency of use of intermediate caches) and this will be the main limiting factor in execution speed rather than the tightness of the loops other than the number of toggle/cull operations that the algorithm requires. These particular algorithms, both SoA and SoE, put a particular strain on the memory interface for large buffers in that they skip through the buffer incrementing by whole wheel circumferences times a multiplier at a time and repeat for a different offsets rather than handling all the "hits" of the wheel in sequence; this is due to moving the "hit" tests into a middle loop, which saves "hit" testing but has this larger buffer "jump" as a consequence, with the average "jump" for the SoA higher than for the SoE. Thus, an estimate of the execution time for this algorithm over the range of numbers to a billion is about 60 CPU clock cycles per operation (which will be less by a factor for the SoE due to a smaller average "jump") multiplied by the number of toggle/cull operations (about 260,000 for the Sieve of Atkin) divided by the CPU clock speed. Therefore, a 3.5 GHz CPU will execute this algorithm over a range to a billion in about four seconds and the main difference between execution speed for non-paged implementations of the SoA and the SoE in this style is the difference in memory access efficiency for large buffers, where a highly wheel factorized SoE is more efficient by about a factor of two.
Given that for large buffers the memory access time is the limiting speed factor, it becomes very important to find an algorithm with a minimum number of operations. The SoA as implemented by Atkin and Bernstein in their "primegen" sample program takes about a constant factor of "hit" operations as a ratio to the range sieved, and by implementing even a simple version of a program from my first answer we can determine that this factor is about 0.26 of the prime range sieved: meaning that there are about 260,000 operations for a sieve range of a billion.
For a SoE implemented with a high wheel factorization of a 210 element wheel with 48 prime candidate "hits" per rotation (small wheel for primes of 2;3;5;7) in combination with a pre-cull by the additional primes of 11;13;17;19, the number of cull operations for a sieve range n is approximately n times (ln((ln n) / 2) + M - 1/(ln n) - 1/2 - 1/3 - 1/5 - 1/7 - 1/11 - 1/13 - 1/17 - 1/19) * 48 / 210 where '*' means multiplied by, '/' means divided by, M is the Meissel-Mertens constant M ≈ 0.2614972128... correcting the number of operations arising from the (ln ln n) estimate, the reciprocal "ln n" term is the adjustment for the culling starting at the square of the base primes (less than the square root of the range n); for n of one billion, the factor compared to n is about 0.250485568 or about 250 million cull operations for a range to a billion.
This is about the same number of operations as for the SoA and contrasts with the simple 2;3;5 wheel used in the Atkin and Bernstein comparison test, which has a factor of 0.404805008 or about 400,000 operations for a sieve range of one billion. This better ratio for the more highly factorized SoE means that the constant factors for this quite realizable implementation are about the same as for the SoA except for any differences in constant factors due to loop complexity.
EDIT_ADD1: For comparison, the following is equivalent pseudo code for a highly factorized and pre-culled SoE written in the same style as the above SoA algorithm:
// arbitrary search limit
limit ← 1000000000
// the set of base wheel primes
r ∈ {23,29,31,37,41,43,47,53, 59,61,67,71,73,79,83, //positions + 19
89,97,101,103,107,109,113,121,127, 131,137,139,143,149,151,157,163,
167,169,173,179,181,187,191,193, 197,199,209,211,221,223,227,229}
// an array of length 11 times 13 times 17 times 19 = 46189 wheels initialized
// so that it doesn't contain multiples of the large wheel primes
for n where n ← 210 × w + x where w ∈ {0,...46189}, x in r: // already
if (n mod cp) not equal to 0 where cp ∈ {11,13,17,19}: // no 2,3,5,7
mstr(n) ← true else mstr(n) ← false // factors
// Initialize the sieve as an array of the smaller wheels with
// enough wheels to include the representation for limit
for n where n ← 210 × w + x, w ∈ {0,...(limit - 19) ÷ 210}, x in r:
sieve(n) ← mstr(n mod (210 × 46189)) // init pre-culled primes.
// Eliminate composites by sieving, only for those occurrences on the
// wheel using wheel factorization version of the Sieve of Eratosthenes
for n² ≤ limit when n ← 210 × k + x where k ∈ {0..}, x in r
if sieve(n):
// n is prime, cull its multiples
s ← n² - n × (x - 23) // zero'th modulo cull start position
while c0 ≤ limit when c0 ← s + n × m where m in r:
c0d ← (c0 - 23) ÷ 210, cm ← (c0 - 23) mod 210 + 23 //means cm in r
while c ≤ limit for c ← 210 × (c0d + n × j) + cm
where j ∈ {0,...}:
sieve(c) ← false // cull composites of this prime
output 2, 3, 5, 7, 11, 13, 17, 19,
for n ≤ limit when n ← 210 × k + x where k ∈ {0..}, x in r:
if sieve(n): output n
Notice how much less complex this code is in spite of it being more highly wheel factorized so that there is less of a constant overhead of operations than the SoA code while the innermost culling loop is actually slightly simpler and therefore (if anything) likely slightly faster. Also notice how many more outer loops must be run for the SoA code in there are a total of very approximately the same number of outer loops as the square root of the sieving range for the SoA where there are only about the sieve range divided by the natural logarithm of the square root of that range outer loops for the SoE - a factor of about ten times less for the SoE compared to the SoA for sieve ranges of one billion. This is the trade-off that the SoA makes: more outer loops for less operations per inner loop with bigger buffer "jumps" between operations; however, this is also what makes the SoA less efficient per operation, especially for larger sieve ranges. END_EDIT_ADD1
As to the code complexity for larger ranges, these are always implemented as page segmented sieves in order to avoid the necessity of having the whole sieve array in RAM memory (memory constraints would limit the range) and in order to take advantage of CPU caching locality to reduce average memory access times as discussed above. For the SoE, this can easily be implemented based on a saved copy of the a base primes representation where there are approximately the square root of the range divided by log of that square root number of that range; for the SoA one needs to refer to that same base primes representation for the prime squares free elimination but as well needs to use values of sequence offsets plus saved values of 'x' or 'y' (or save the current values of both 'x' and 'y') in order to resume at the next page offset with a total number of saved sequences as about the square root of the range (not divided by natural logarithm of root n), or alternatively new 'x'/'y' pair offsets can be calculated for the start of each new page at a considerable computational overhead. The first method adds greatly to the memory use overhead, especially for multi-threaded implementations where a copy needs to be kept for each thread; the second doesn't use more memory but uses more computational power as compared to the SoE for these larger ranges.
Page segmented algorithms are also important in order to take full advantage of modern multi-core CPU's which can reduce the clock time required for a given task by the ratio of the number of cores used, especially for page divided algorithms where each page processing can be assigned to a separate thread.
The way that the "primesieve" SoE implementation manages average operation times of about 2.5 CPU clock cycles as compared to the about 4.5 CPU clock cycles for Atkin and Bernstein's "primegen" sample implementation of the SoA is by extreme in-lining of loop unrolling with page segmentation; this technique can also be applied to the SoA but this particular pseudo code is unsuitable for doing that and an alternate technique of "hit rate" optimization that handles all of the modulos in a single loop (per quadratic equation loop) needs to be used; however, that is a complex algorithm that would seem to be beyond the scope of this question. Suffice it to say that the jumps in complexity so far are nothing as compared to what is required to add this extra optimization to the SoA, and even with that the ratio of outer loops still makes the SoE much more efficient.
Thus, while it is fairly easy to implement a naive SoA algorithm as in the original Wikipedia pseudo code or even using the more complex SoA pseudo code as in these two answers, it is extremely complex to write something that would compete with even Bernstein's "primegen" (SoA) let alone with the faster "primesieve" (SoE), whereas it is quite easy to write code that comes close to the same performance as "primegen" and just a little more work to write something that competes with "primesieve" using the SoE algorithm with the slight modifications to support page segmentation.
EDIT_ADD2: To put this theory into practical terms, I would normally write an example program to demonstrate these principles; however, in this case we have all we really need in the "primegen" source code, which is the reference implementation by the inventors of the SoA. On my machine (i7-2600K, 3.5 GHz) with the buffer size adjusted to 32768 from 8192 to reflect my L1 cache size, this "primespeed" sieves the primes to one billion in about 0.365 seconds. In fact, Atkin and Bernstein cheated a little in their comparison to their Eratosthenes sieve algorithm in that they only assigned about four Kilobyte buffers to it; adjusting that buffer size to about the same size (the "#define B32 1001" to "#define B32 8008") results in an execution time of about 0.42 seconds for "eratspeed" to about one billion, leaving it still slightly slower. However, as discussed above, it is doing more work in that it is doing about 0.4048 billion cull operations whereas the SoA is only doing about 0.26 billion combined toggle/cull operations. Using the algorithm from the above SoE pseudo code to change this is have maximum wheel factorization means that the SoE would actually do slightly less operation than the SoA at about 0.2505 billion, meaning the the execution time would be reduced to about two thirds or more or about 0.265 to 0.3 seconds making it faster than "primegen".
Meanwhile, if one compares the "primespeed" to the "eratspeed" code one sees that the paged SoE is much less complex than the SoA code, just as indicated by the above pseudo code.
Of course, one could counter that the SoA maintains the ratio of 0.26 times the sieving range operations to as high a sieving range as one wants whereas the SoE ratio climbs as per the equations given above - by about an extra 20% for a sieving range even to one hundred billion. However, given a slight tweak to both algorithms to increase buffer size and "sub slice" it to process it in L1 caches sized slices, the SoE will perform as predicted whereas the overhead of the SoA continues to rise due to the extra ratio of middle/inner loops for the SoA as compared to the SoE. While the SoE has an increase in culling operations by this about 20%, the SoA has an increase in quadratic processing loops of about this same 20% for a very small net gain if any. If these optimizations were made to the SoA, then it might just catch up to the maximally wheel factorized SoE at some very large sieving range such as ten to the nineteenth power or so, but we are talking hundreds of years of processing on a desktop computer even using multi-processing.
The SoA might be practical for sieving a sub-range (ie. a single page segment) from a very large range offset (say ten raised to the twenty-second power), but there is still that cost of extra memory or computation in dealing with all those values of 'x' and 'y' that are not necessary with the SoE. END_EDIT_ADD2
Just as for my first answer, the comparison says that SoA loses to the highly wheel factorized SoE on almost all fronts, as follows:
The SoA code is much more complex and it is harder to write and maintain, thus much harder to implement page segmentation which is so important for larger ranges in order to take advantage of maximum CPU capabilities as to memory access speeds.
The SoA likely loses on CPU clock cycles by at least 20% due to slightly more complex inner loop code and higher numbers of outer loops.
The SoA has slightly more operations as compared to a highly wheel factorized SoE (higher wheel factorization than as per the above discussion) for a reasonably large range of one to four billion.
The SoA wins for very large ranges by having an improvement factor of log (log n) as ranges get bigger, but that factor would only allow the SoA to catch up to the highly wheel factorized SoE at a very high number range (perhaps at about 10 raised to the nineteenth power) if ever, and then only if the relative complexity stays the same, which it isn't likely to do.
So I say again: "Why use the Sieve of Atkin when the Sieve of Eratosthenes is better in almost every respect and especially is less complex while having less of a constant computational factor overhead?" Based on my conclusions here that I feel the SoA is an inferior sieve for any kind of reasonable sieve range and will likely never write a page segmented version of the Sieve of Atkin but have written many of them in various computer languages for the Sieve of Eratosthenes. While the Sieve of Atkin is interesting intellectually and mathematically, it isn't really practical and the Atkin and Bernstein comparison was flawed in overly restricting the wheel factorization of their implementation of the Sieve of Eratosthenes used in their comparison to show an advantage of about 30%, which should have actually been about 60% other than for the constant implementation overheads of the Sieve of Atkin.
来源:https://stackoverflow.com/questions/19388106/the-sieve-of-atkin