Implementing the Page Segmented Sieve of Eratosthenes in Javascript

牧云@^-^@ 提交于 2019-11-29 08:43:39

Although you have gathered from your reading that a Page Segmented Sieve of Eratosthenes is a fast way of finding the primes over a large range, your question code (even when corrected) does not implement a Page Segmented SoE, test the code over a large range, nor is it particularly fast as SoE implementations go. The following discussion will show how to progress toward using a true Page Segmented SoE over a large range.

Synopsis

Following is a staged progression of increasingly fast implementations leading to your intention, with commentary explaining the reasons and implementation details of every step. It includes runnable snippets in JavaScript, but these techniques are not limited to just JavaScript and other languages don't place limits on some further refinements such as multi-threading of the resulting pages (other than by Web Workers, which are difficult to control as to order of processing), some further optimizations in extreme loop unrolling, and which last is related to the limited efficiency of the code due to having to be Just In Time (JIT) compiled to native code by the JavaScript engine in your browser; these limits are as compared to languages that compile directly to very efficient native code such as C/C++, Nim, Rust, Free Pascal, Haskell, Julia, etc.

Chapter 1 - Setting a Baseline

First, lets start with a working version of your current code algorithm used over a reasonably large range with timing information to establish a base line; the following code starts culling per prime at the square of the culling prime which avoids the problem of culling the given prime values and some redundant starting culls and there is no reason to generate an output array of resulting primes for the large range as we can produce the primes directly from the culling array; also, the determination of the answer is outside the timing because we will be developing better techniques to find an "answer" which for large ranges is usually the count of the number of primes found, the sum of the primes, the first occurrences of prime gaps, etc., none of which need to actually view the found primes:

"use strict";

function soePrimesTo(limit) {
  var sz = limit - 1;
  var cmpsts = new Uint8Array(sz); // index 0 represents 2; sz-1 is limit
  // no need to zero above composites array; zeroed on creation...
  for (var p = 2; ; ++p) {
    var sqr = p * p;
    if (sqr > limit) break; // while p is the square root of limit -> cull...
    if (cmpsts[p - 2] == 0 >>> 0) // 0/1 is false/true; false means prime...
      for (var c = sqr - 2; c < sz; c += p) // set true for composites...
          cmpsts[c] = 1 >>> 0; // use asm.js for some extra efficiency...
  }
  var bi = 0
  return function () {
    while (bi < sz && cmpsts[bi] != 0 >>> 0) ++bi;
    if (bi >= sz) return null;
    return bi++ + 2;
  };
}

// show it works...
var gen = soePrimesTo(100);
var p = gen();
var output = [];
while (p != null) { output.push(p); p = gen(); }
console.log("Primes to 100 are:  " + output + ".");

var n = 1000000000; // set the range here...
var elpsd = -new Date();
gen = soePrimesTo(n);
elpsd += +new Date();
var count = 0;
while (gen() != null) { count++; }
console.log("Found " + count + " primes up to " + n + " in " + elpsd + " milliseconds.");

Now, there is little purpose in quoting my run times for the above code sieving to a billion as your times will almost certainly be faster as I am using a very low end tablet CPU in the Intel x5-Z8350 running at 1.92 Gigahertz (CPU clock speed for a single active thread). I am going to quote my execution time exactly once as an example of how to calculate average CPU clock cycles per cull: I take the execution time for the above code of about 43350 milliseconds is 43.35 seconds times 1.92 billion clocks per second divided by about 2.514 billion culling operations to sieve to a billion (which can be calculated from the formula or from the table for no wheel factorization on the SoE Wikipedia page) to get about 33.1 CPU clock cycles per cull to a billion. From now on, we will only use CPU clock cycles per cull to compare performance.

Further note that these performance scores are very dependent on the browser JavaScript engine used, with the above score as run on Google Chrome (version 72); Microsoft Edge (version 44) is about seven times slower for the version of Page Segmented SoE towards which we are moving, and Firefox and Safari are likely to be close to Chrome in performance.

This performance is likely better than for the previous answers code due to using the Uint8Array TypedArray and more asm.js but the timing for such "one huge array sieves" (a GigaByte of memory used here) are subject to bottlenecks due to memory access speed for main RAM memory outside the CPU cache limits. This is why we are working toward a Page Segmented Sieve, but first let's do something about reducing the amount of memory used and the number of required culling cycles.

Chapter 2 - Bit Packing and Odds Only Wheel Factorization

The following code does bit packing, which requires slightly more complexity in the tight inner culling loop, but as it uses just one bit per composite number considered, it reduces the memory use by a factor of eight; as well, since two is the only even prime, it uses basic wheel factorization to sieve odds only for a further reduction of memory use by a factor of two and a reduction in number of culling operations by a factor of about 2.5.

This minimal wheel factorization of odds-only works as follows:

  1. We make a "wheel" with two positions on it, one half that "hits" on the even numbers, and the other half that hits on the odd numbers; thus this "wheel" has a circumferential span of two numbers as we "roll" it over all the potential prime numbers, but it only "windows" one out of two or half of the numbers over which it "rolls".
  2. We then divide all of numbers over which we "roll" into two bit planes of successive values, in one bit plane we place all of the evens starting at four, and in the other bit plane we place all of the odds starting at three.
  3. Now we discard the even plane as none of the represented numbers there can be prime.
  4. The starting index of p * p for considered odd base primes always is an odd number because multiplying an odd number by an odd number is always an odd number.
  5. When we increment the index into the odd bit plane by the prime value, we are actually adding two times the value since adding an odd and an odd produces an even, which would be in the bit plane we discarded, so we add the prime value again to get back to the odd bit plane again.
  6. The odd bit plane index positions automatically account for this since we have thrown away all the even values that were previously between each odd bit position index.
  7. That is why we can cull by just repeatedly adding prime to the index in the following code.

"use strict";

function soeOddPrimesTo(limit) {
  var lmti = (limit - 3) >> 1; // bit index for limit value
  var sz = (lmti >> 3) + 1; // size in bytes
  var cmpsts = new Uint8Array(sz); // index 0 represents 3
  // no need to zero above composites array; zeroed on creation...
  for (var i = 0; ; ++i) {
    var p = i + i + 3; // the square index is (p * p - 3) / 2 but we
    var sqri = (i << 1) * (i + 3) + 3; // calculate start index directly 
    if (sqri > lmti) break; // while p is < square root of limit -> cull...
    // following does bit unpacking to test the prime bit...
    // 0/1 is false/true; false means prime...
    // use asm.js with the shifts to make uint8's for some extra efficiency...
    if ((cmpsts[i >> 3] & ((1 >>> 0) << (i & 7))) == 0 >>> 0)
      for (var c = sqri; c <= lmti; c += p) // set true for composites...
          cmpsts[c >> 3] |= (1 >>> 0) << (c & 7); // masking in the bit
  }
  var bi = -1
  return function () { // return function to return successive primes per call...
    if (bi < 0) { ++bi; return 2 } // the only even prime is a special case
    while (bi <= lmti && (cmpsts[bi >> 3] & ((1 >>> 0) << (bi & 7))) != 0 >>> 0) ++bi;
    if (bi > lmti) return null; // return null following the last prime
    return (bi++ << 1) + 3; // else calculate the prime from the index
  };
}

// show it works...
var gen = soeOddPrimesTo(100);
var p = gen();
var output = [];
while (p != null) { output.push(p); p = gen(); }
console.log("Primes to 100 are:  " + output + ".");

var n = 1000000000; // set the range here...
var elpsd = -new Date();
gen = soeOddPrimesTo(n);
elpsd += +new Date();
var count = 0;
while (gen() != null) { count++; }
console.log("Found " + count + " primes up to " + n + " in " + elpsd + " milliseconds.");

The performance is now about 34.75 CPU clock cycles per cull operation for 1.0257 billion culling operations for a range of a billion for odds only (from Wikipedia), meaning that the reduction in time is almost entirely due to the reduction in number of culling operations, with the extra complexity of the "bit-twiddling" bit packing only taking about the same amount of extra time as the savings due to the reduction in memory use by a factor of sixteen. Thus, this version uses one sixteenth the memory and is about 2.5 times faster.

But we aren't done yet, page segmentation can speed us up even more just as your source told you.

Chapter 3 - Page Segmentation Including Faster Primes Counting

So what is Page Segmentation applied to the SoE and what does it do for us?

Page Segmentation is dividing the work of sieving from one huge array sieved at once to a series of smaller pages that are sieved successively. It then requires a little more complexity in that there must be a separate stream of base primes available, which can be obtained by sieving recursively with an inner sieve generating the list of memoized base primes to be used by the main sieve. As well, the generation of output results is generally a little more complex as it involves successive scanning and reduction for each of the generated sieved pages.

Page Segmentation has many benefits as follows:

  1. It further reduces memory requirements. To sieve to a billion with the previous odds-only code takes about 64 MegaBytes of RAM, but couldn't use that algorithm to sieve beyond about a range of 16 to 32 billion (even if we wanted to wait quite a long time) due to using all the memory available to JavaScript. By sieving by a page size of (say) a square root of that amount, we could sieve to a range of that amount squared or beyond anything we would have time for which to wait. We also have to store the base primes up to the square root of the desired range, but that is only a few tens of MegaBytes to any range we would care to consider.
  2. It increases memory access speed which has a direct impact on the number of CPU clack cycles per culling operation. If culling operations mostly occur within the CPU caches, memory access changes from tens of CPU clock cycles per access for main RAM memory (depending on the CPU and the quality and type of RAM) to about one clock cycle for the CPU L1 cache (smaller at about 16 KiloByte's) to about eight clock cycles from the CPU L2 cache (larger for at least about 128 KiloByte's) and we can work out culling algorithms so that we use all the caches to their best advantage, using the small fast L1 cache for the majority of the operations with small base prime values, the larger little-bit-slower L2 cache for middle sized base primes, and only use main RAM access for the vary few operations using large base primes for extremely huge ranges.
  3. It opens the possibility of multi-threading by assigning the work of sieving each largish page to a different CPU core for a fairly coarse grained concurrency, although this doesn't apply to JavaScript other than through the use of Web Workers (messy).

Other than the added complexity, Page Segmentation has one other problem to work around: unlike the "one huge array" sieve where start indexes are calculated easily once and then used for the entire array, segmented sieves require either that the start addresses be calculated by modulo (division) operations per prime per page (computationally expensive), or additional memory needs to be used to store the index reached per base prime per page so the start index doesn't have to be recalculated, but this last technique precludes multi-threading as these arrays get out of sync. The best solution that will be used in the Ultimate version is to use a combination of these techniques, where several page segments are grouped to form a reasonably large work unit for threading so that these address calculations take an acceptably small portion of the total time, and the index storage tables are used for the base primes across these larger work units per thread so that the complex calculations need only be done once per larger work unit. Thus we get both the possibility of multi-threading and a reduced overhead. However, the following code doesn't reduce this overhead, which costs about 10% to 20% when sieving to a billion.

In addition to Page Segmentation, the following code adds an efficient counting of the found primes by using an Counting Look Up Table (CLUT) population counting algorithm that counts 32 bits at a time so that the overhead of continuously finding the result of the number of found primes a a small percentage of the sieving time. If this were not done, enumerating the individual found primes just to determine how many there are would take at least as long as it takes to sieve over a given range. Similar fast routines can easily be developed to do such things as sum the found primes, find prime gaps, etc.

START_EDIT:

The following code adds one other speed-up: for smaller primes (where this optimization is effective), the code does a form of loop separation by recognizing that the culling operations follow a eight step pattern. This is because a byte has an even number of bits and we are culling by odd primes that will return to the same bit position in a byte every eight culls; this means that for each of the bit positions, we can simplify the inner culling loop to mask a constant bit, greatly simplifying the inner loop and making culling up to about two times faster due to each culling loop in the pattern not needing to do the "bit-twiddling" bit packing operations. This change saves about 35% of the execution time sieving to a billion. It can be disabled by changing the 64 to 0. It also sets the stage for the native code extreme unrolling of the eight loop due to this pattern that can increase the culling operation speed by another factor of about two when native code compilers are used.

A further minor modification makes the loop faster for the larger primes (greater than 8192) by using a Look Up Table (LUT) for the mask value rather than the left shift operation to save about half a CPU clock cycle per culling operation on average when culling a range of a billion; this saving will be increased slightly as ranges go up from a billion but isn't that effective in JavaScript and has been removed.

END_EDIT

ANOTHER_EDIT:

As well as the above edits, we have removed the LUT BITMASK but now zero the Sieve Buffer by fast byte copying from a zero buffer of the same size, and added a Counting LUT population count technique for about a 10% overall gain in speed.

END_ANOTHER_EDIT

// JavaScript implementation of Page Segmented Sieve of Eratosthenes...
// This takes almost no memory as it is bit-packed and odds-only,
// and only uses memory proportional to the square root of the range;
// it is also quite fast for large ranges due to imrproved cache associativity...

"use strict";

const ZEROSPTRN = new Uint8Array(16384);

function soePages(bitsz) {
  let len = bitsz >> 3;
  let bpa = [];
  let buf =  new Uint8Array(len);
  let lowi = 0;
  let gen;
  return function () {
    let nxt = 3 + ((lowi + bitsz) << 1); // just beyond the current page
    buf.set(ZEROSPTRN.subarray(0,buf.length)); // clear sieve array!
    if (lowi <= 0 && bitsz < 131072) { // special culling for first page as no base primes yet:
      for (let i = 0, p = 3, sqr = 9; sqr < nxt; ++i, p += 2, sqr = p * p)
        if ((buf[i >> 3] & (1 << (i & 7))) === 0)
          for (let j = (sqr - 3) >> 1; j < 131072; j += p)
            buf[j >> 3] |= 1 << (j & 7);
    } else { // other than the first "zeroth" page:
      if (!bpa.length) { // if this is the first page after the zero one:
        gen = basePrimes(); // initialize separate base primes stream:
        bpa.push(gen()); // keep the next prime (3 in this case)
      }
      // get enough base primes for the page range...
      for (let p = bpa[bpa.length - 1], sqr = p * p; sqr < nxt;
           p = gen(), bpa.push(p), sqr = p * p);
      for (let i = 0; i < bpa.length; ++i) { // for each base prime in the array
        let p = bpa[i] >>> 0;
        let s = (p * p - 3) >>> 1; // compute the start index of the prime squared
        if (s >= lowi) // adjust start index based on page lower limit...
          s -= lowi;
        else { // for the case where this isn't the first prime squared instance
          let r = (lowi - s) % p;
          s = (r != 0) ? p - r : 0;
        }
        if (p <= 32) {
          for (let slmt = Math.min(bitsz, s + (p << 3)); s < slmt; s += p) {
            let msk = ((1 >>> 0) << (s & 7)) >>> 0;
            for (let c = s >>> 3, clmt = bitsz >= 131072 ? len : len; c < clmt | 0; c += p)
              buf[c] |= msk;
          }
        }
        else
          // inner tight composite culling loop for given prime number across page
          for (let slmt = bitsz >= 131072 ? bitsz : bitsz; s < slmt; s += p)
            buf[s >> 3] |=  ((1 >>> 0) << (s & 7)) >>> 0;
      }
    }
    let olowi = lowi;
    lowi += bitsz;
    return [olowi, buf];
  };
}

function basePrimes() {
  var bi = 0;
  var lowi;
  var buf;
  var len;
  var gen = soePages(256);
  return function () {
    while (true) {
      if (bi < 1) {
        var pg = gen();
        lowi = pg[0];
        buf = pg[1];
        len = buf.length << 3;
      }
      //find next marker still with prime status
      while (bi < len && buf[bi >> 3] & ((1 >>> 0) << (bi & 7))) bi++;
      if (bi < len) // within buffer: output computed prime
        return 3 + ((lowi + bi++) << 1);
      // beyond buffer range: advance buffer
      bi = 0;
      lowi += len; // and recursively loop to make a new page buffer
    }
  };
}

const CLUT = function () {
  let arr = new Uint8Array(65536);
  for (let i = 0; i < 65536; ++i) {
    let nmbts = 0|0; let v = i;
    while (v > 0) { ++nmbts; v &= (v - 1)|0; }
    arr[i] = nmbts|0;
  }
  return arr;
}();

function countPage(bitlmt, sb) {
  let lst = bitlmt >> 5;
  let pg = new Uint32Array(sb.buffer);
  let cnt = (lst << 5) + 32;
  for (let i = 0 | 0; i < lst; ++i) {
    let v = pg[i]; cnt -= CLUT[v & 0xFFFF]; cnt -= CLUT[v >>> 16];
  }
  var n = pg[lst] | (0xFFFFFFFE << (bitlmt & 31));
  cnt -= CLUT[n & 0xFFFF]; cnt -= CLUT[n >>> 16];
  return cnt;
}

function countSoEPrimesTo(limit) {
  if (limit < 3) {
    if (limit < 2) return 0;
    return 1;
  }
  var cnt = 1;
  var lmti = (limit - 3) >>> 1;
  var lowi;
  var buf;
  var len;
  var nxti;
  var gen = soePages(131072);
  while (true) {
    var pg = gen();
    lowi = pg[0];
    buf = pg[1];
    len = buf.length << 3;
    nxti = lowi + len;
    if (nxti > lmti) {
      cnt += countPage(lmti - lowi, buf);
      break;
    }
    cnt += countPage(len - 1, buf);
  }
  return cnt;
}

var limit = 1000000000; // sieve to this limit...
var start = +new Date();
var answr = countSoEPrimesTo(limit);
var elpsd = +new Date() - start;
console.log("Found " + answr + " primes up to " + limit + " in " + elpsd + " milliseconds.");

As implemented here, that code takes about 13.8 CPU clock cycles per cull sieving to a billion. The extra about 20% saving by improving the address calculation algorithm is automatically improved when using the following maximum wheel factorization algorithm due to the effective page sized increasing by a factor of 105 so that this overhead becomes only a few percent and comparable to the few percent used for array filling and result counting.

Chapter 4 - Further Advanced Work to add Maximal Wheel Factorization

Now we consider the more extensive changes to use maximum wheel factorization (by not only 2 for "odds-only", but also 3, 5, and 7 for a wheel that covers a span of 210 potential primes instead of a span of just 2) and also pre-culling on initialization of the small sieving arrays so that it is not necessary to cull by the following primes of 11, 13, 17, and 19. This reduces the number of composite number culling operations when using the page segmented sieve by a factor of about four to a range of a billion (as shown in the tables/calculated from the formulas in the Wikipedia article - "combo wheel") and can be written so that it runs about four times as fast due to the reduced operations with each culling operation about the same speed as for the above code.

The way of doing the 210-span wheel factorization efficiently is to follow on the "odds-only" method: the current algorithm above can be thought of as sieving one bit-packed plane out of two as explained in the previous chapter where the other plane can be eliminated as it contains only the even numbers above two; for the 210-span we can define 48 bit-packed arrays of this size representing possible primes of 11 and above where all the other 162 planes contain numbers which are factors of two, three, five, or seven and thus don't need to be considered. Each bit plane can than be culled just by repeated indexing in increments by the base prime just as the odd number plane was done with the multiplications automatically being taken care of by the structure just as for odds only. In this way it is just as efficient to sieve with less memory requirements (by 48/210 as compared to "odds-only" at 1/2) and with as much efficiency as for odds only, where one 48-plane "page" represents for a 16 Kilobytes = 131072 bits per plane size times 210 is a range of 27,525,120 numbers per sieve page segment, thus only almost 40 page segments to sieve to a billion (instead of almost four thousand as above), and therefore less overhead in start address calculation per base prime per page segment for a further gain in efficiency.

Although the extended code described above is a couple of hundred lines and long to post here, it can count the number of primes to a billion in about two seconds on my low end Intel 1.92 Gigahertz CPU using the Google V8 JavaScript engine, which is about five times slower than the same algorithm run in native code.

Although the above code is quite efficient up to a range of about 16 billion, other improvements can help maintain the efficiency to even larger ranges of several tens of thousands of billions such as 1e14 or more. We accomplish this by adjusting the effective page sizes upward so that they are never smaller than the square root of the full range being sieved, yet sieve them incrementally by 16 KiloByte chunks for small primes, 128 KiloByte chunks for medium primes, and only sieve the a huge array as per our baseline implementation for the very few culling operations used for the largest base prime sizes. In this way, our clocks per cull doesn't increase more than a small factor of up to about two for the largest range we would likely consider.

As this answer is close to the limited size of 30,000 characters, further discussion on Maximal Wheel Factorization is continued on my followup Chapter 4.5a and (future) Chapter 4.5b answers for actual implementations of the techniques described above.

Chapter 5 - What JavaScript (and other VM Languages) Can't Do

For JavaScript and other Virtual Machine languages, the minimum culling loop time is in the order of ten CPU cycles per culling loop and that is not likely to change much. This is about three to four times slower than the about three CPU clock cycles that can easily be achieved with languages that compile directly to efficient machine code using the same algorithm such as C/C++, Nim, Rust, Free Pascal, Haskell, Julia, etc.

In addition, there are extreme loop unrolling techniques that can be used with at least some of those languages that can produce a reduction in average culling operation cycles by a further factor of about two that is denied to JavaScript.

Multi-threading can reduce the execution time by about the factor of effective CPU cores used, but with JavaScript the only way to get this is though the use of Web Workers and that is messy to synchronize. On my machine, I have four cores but only gain a factor of three in speed due the the CPU clock rate being reduced to three quarters with all cores active; this factor of three is not easy to gain for JavaScript.

So this is about the state-of-the-art using JavaScript and other current VM languages have about the same limitation other than they can easily use multi-threading, with the combination of the above factors meaning native code compilers can be about twenty times faster than JavaScript (including multi-threading, and even more with new CPU's with huge numbers of cores).

However, I believe that the future of Web programming in three to five years will be Web Assembly, and that has the potential to overcome all of these limitations. It is very near to supporting multi-threading now, and although currently only about 30% faster than JavaScript for this algorithm on Chrome, it is up to only a little slower than native code on some current browsers when compiled from some languages using some Web Assembly compilers. It is still early days of development for efficient compilers to Web Assembly and for efficient browser compilation to native code, but as Web Assembly is closer to native code than most VM's, it could easily be improved to produce native code that is as fast or nearly as fast as code from those other notive-code compiling languages.

However, other than for compilation of JavaScript libraries and Frameworks into Web Assembly, I don't believe the future of the Web will be JavaScript to Web Assembly compilers, but rather compilation from some other language. My favourite picks of the future of Web programming is F# with perhaps the Fable implementation converted to produce Web Assembly rather than JavaScript (asm.js) or Nim. There is even some possibility that Web Assembly could be produced that supports and shows the advantage of the extreme loop unrolling for very close to the fastest of known Page Segmented SoE speeds.

Conclusion

We have built a Page Segmented Sieve of Eratosthenes in JavaScript that is suitable for sieving large ranges in the billions, and have a means to further extend this work. The resulting code has about ten times less culling operations (when fully wheel factorized) and the culling operations are about three times faster meaning the code per given (large) range is about 30 times faster while the reduced memory use means that one can sieve to ranges of the 53 bit mantissa of about 9e15 in something of the order of a year (just leave a Chrome browser tab open and back up the power supply).

Although there are some other minor tweaks possible, this is about the state of the art in sieving primes using JavaScript: while it isn't a fast as native code for the given reasons, it is fast enough to find the number of primes to 1e14 in a day or two (even on a mid range smartphone) by leaving a browser tab open for the required computation time; this es quite amazing as the number of primes in this range wasn't known until 1985 and then by using numerical analysis techniques and not by using a sieve as the computers of that day weren't fast enough using the fastest coding techniques to do it in a reasonable and economic amount of time. Although we can do this just in just a few hours using these algorithms for the best of the native code compilers on modern desktop computers, now we can do it in an acceptable time with JavaScript on a smart phone!

This expands my previous answer as to adding what was promised but for which there wasn't room in the 30,000 character per answer limit:

Chapter 4.5a - the Basics of Full Wheel Factorization

The non-Maximally Wheel Factorized Page Segmented Sieve of Eratosthenes version from Chapter 3 in the previous answer was written as a Prime Generator with the output recursively fed back as an input for the base prime number feed; although that was very elegant and expandable, in the following work I have taken a step back to a more imperative style of code so the readers can more readily understand the core concepts of the Maximal Wheel Factorization. In a future Chapter 4.5b I will combine the concepts developed in the following example back into a Prime Generator style and add some extra refinements that won't make it any faster for smaller ranges of a few billion, but will make the concept usable without much of a loss of speed up to trillions to hundreds or thousands of trillions; the Prime Generator format is then more useful in making the program adapt as the range gets larger.

The following example's main extra refinements are in the various Look Up Tables (LUT's) used for efficient addressing of the wheel modulo residues, the generation of the special start addresses LUT that quite simply allows one to calculate the culling start address for each modulo residue bit plane given the start address wheel index and first modulo residue bit plane index of the very first cull in the entire structure, and the Sieve Buffer composite number representation culling function that uses these.

This example is based on a 210 number circumference wheel using the small primes of two, three, five, and seven as it seems that hits the "sweet spot" of efficiency for the size of the arrays and numbers of bit planes but experiments have shown that about another 5% in performance can be gained by adding the next prime of eleven to the wheel for a circumference of 2310 numbers; the reason that this wasn't done is that initialization time goes up greatly and that it makes it hard to time for smaller ranges of "only " a billion as there are then only about four segments to reach that point and granularity becomes a problem.

Note that the first sieved number is 23 which is the first prime past the wheel primes and pre-cull primes; by using this, we avoid the problems of dealing with arrays that start from "one" and the problems of restoring the wheel primes that are eliminated and must be added back in by some algorithms.

Basically, for every Page Segment culling sweep, there is a starting loop that fills a start address array with the wheel index and modulo residue index of the first cull address within the segment for each of the base primes less than the square root of the maximum number represented in the Page Segment, then this start address array is used sieving each modulo residue bit plane (48 of them) completely in turn with all base primes scanned for each bit plane and the appropriate offset calculated from the segment start address per base prime by use of the multiplier and offset in the WHLSTARTS LUT. This is done by multiplying the wheel index of the base prime my the looked up multiplier and adding the looked up offset to obtain the wheel index start position for the given modulo residue bit plane. From there on, the per bit plane culling is just like as for Chapter Three for the odd number bit plane. This is done 48 times for each of the bit planes, but the effective range per page segment for a 16 Kilobyte buffer (per bit plane) is 131072 times the 210 wheel span or 27,525,120 numbers per Page Segment.

Use of this sieve reduces memory use as a ratio of the effective range of the segment by a factor of 48 over 105 as compared to the Chapter 3 odds-only sieve or to a factor of less than half, but because each segment has all 48 bit planes, the full Sieve Buffer is 16 Kilobytes times the 48 bit planes or 768 Kilobytes (three quarters of a Megabyte). However, use of a Sieve Buffer of this size is only efficient up to about 16 billion and our next example in the next chapter will adapt the size of the buffer for huge ranges so that it grows to about a hundred Megabytes for the largest ranges. For multi-threaded languages (not JavaScript), this would be a requirement per thread.

Additional memory requirements are for the storage of the base primes array of 32-bit values representing the wheel index for the base prime and its modulo residue index (necessary for the modulo address calculations as explained above); for a range of a billion there are about 3600 base primes times four bytes each is about 14,000 bytes with the additional start address array the same size. These arrays grow as the square root of the maximum value to be sieved, so grow to about 5,761,455 values (times four bytes each) for the base primes less than a hundred million necessary to sieve to 10^16 (ten thousand trillion) or about 23 Megabytes each. Although half of this memory is required per thread, it is much smaller than the memory required for the expanded Sieve Buffer's themselves.

A further refinement is adapted to the following example in using a "combo" sieve where the Sieve Buffer is pre-filled from a larger wheel pattern from which the factors of the primes of eleven, thirteen, seventeen, and nineteen, have been eliminated; bigger ranges of eliminations than this are impractical as the saved pre-culled pattern grows from only about 64 Kilobytes per modulo bit plane to about twenty times that large or about one and a half Megabytes for each of the 48 modulo residue number planes or about sixty Megabytes just by adding the extra factor of the prime of 23 - again, it is quite a large cost in memory and initialization for only a few percent in performance. Do note that this array can be shared for use across all threads.

As implemented, the WHLPTRN array is about 64 Kilobytes times 48 modulo bit planes are about 3 Megabytes, which isn't that large and which is a fixed size not changing with increasing sieving range; this is a quite workable size as to access speed and initialization time.

These "Maximal Wheel Factorization" refinements reduce the total number of composite number culling operation loops for sieving a range of a billion from about a billion operations for the Chapter 3 odds-only example to about a quarter billion for this "combo" sieve, with the goal to try to keep the number of CPU clock cycles the same per culling operation and thus a gain of a factor of four in speed.

The JavaScript example as described above is implemented as follows:

"use strict";

const LIMIT =  1000000000;

const WHLPRMS = new Uint32Array([2,3,5,7,11,13,17,19]);
const FRSTSVPRM = 23;
const WHLODDCRC = 105 | 0;
const WHLHITS = 48 | 0;
const WHLODDGAPS = new Uint8Array([
  3, 1, 3, 2, 1, 2, 3, 3, 1, 3, 2, 1, 3, 2, 3, 4,
  2, 1, 2, 1, 2, 4, 3, 2, 3, 1, 2, 3, 1, 3, 3, 2,
  1, 2, 3, 1, 3, 2, 1, 2, 1, 5, 1, 5, 1, 2, 1, 2 ]);
const RESIDUES = new Uint32Array([
  23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71,
  73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 121, 127,
  131, 137, 139, 143, 149, 151, 157, 163, 167, 169, 173, 179,
  181, 187, 191, 193, 197, 199, 209, 211, 221, 223, 227, 229, 233 ]);
const WHLNDXS = new Uint8Array([
  0, 0, 0, 1, 2, 2, 2, 3, 3, 4, 5, 5, 6, 6, 6,
  7, 7, 7, 8, 9, 9, 9, 10, 10, 11, 12, 12, 12, 13, 13,
  14, 14, 14, 15, 15, 15, 15, 16, 16, 17, 18, 18, 19, 20, 20,
  21, 21, 21, 21, 22, 22, 22, 23, 23, 24, 24, 24, 25, 26, 26,
  27, 27, 27, 28, 29, 29, 29, 30, 30, 30, 31, 31, 32, 33, 33,
  34, 34, 34, 35, 36, 36, 36, 37, 37, 38, 39, 39, 40, 41, 41,
  41, 41, 41, 42, 43, 43, 43, 43, 43, 44, 45, 45, 46, 47, 47, 48 ]);
const WHLRNDUPS = new Uint8Array( // two rounds to avoid overflow, used in start address calcs...
  [ 0, 3, 3, 3, 4, 7, 7, 7, 9, 9, 10, 12, 12, 15, 15,
    15, 18, 18, 18, 19, 22, 22, 22, 24, 24, 25, 28, 28, 28, 30,
    30, 33, 33, 33, 37, 37, 37, 37, 39, 39, 40, 42, 42, 43, 45,
    45, 49, 49, 49, 49, 52, 52, 52, 54, 54, 57, 57, 57, 58, 60,
    60, 63, 63, 63, 64, 67, 67, 67, 70, 70, 70, 72, 72, 73, 75,
    75, 78, 78, 78, 79, 82, 82, 82, 84, 84, 85, 87, 87, 88, 93,
    93, 93, 93, 93, 94, 99, 99, 99, 99, 99, 100, 102, 102, 103, 105,
    105, 108, 108, 108, 109, 112, 112, 112, 114, 114, 115, 117, 117, 120, 120,
    120, 123, 123, 123, 124, 127, 127, 127, 129, 129, 130, 133, 133, 133, 135,
    135, 138, 138, 138, 142, 142, 142, 142, 144, 144, 145, 147, 147, 148, 150,
    150, 154, 154, 154, 154, 157, 157, 157, 159, 159, 162, 162, 162, 163, 165,
    165, 168, 168, 168, 169, 172, 172, 172, 175, 175, 175, 177, 177, 178, 180,
    180, 183, 183, 183, 184, 187, 187, 187, 189, 189, 190, 192, 192, 193, 198,
    198, 198, 198, 198, 199, 204, 204, 204, 204, 204, 205, 207, 207, 208, 210, 210 ]);
const WHLSTARTS = function () {
  let arr = new Array(WHLHITS);
  for (let i = 0; i < WHLHITS; ++i) arr[i] = new Uint16Array(WHLHITS * WHLHITS);
  for (let pi = 0; pi < WHLHITS; ++pi) {
    let mltsarr = new Uint16Array(WHLHITS);
    let p = RESIDUES[pi]; let i = (p - FRSTSVPRM) >> 1;
    let s = ((i << 1) * (i + FRSTSVPRM) + (FRSTSVPRM * ((FRSTSVPRM - 1) >> 1))) | 0;
    // build array of relative mults and offsets to `s`...
    for (let ci = 0; ci < WHLHITS; ++ci) {
      let rmlt = (RESIDUES[((pi + ci) % WHLHITS) | 0] - RESIDUES[pi | 0]) >> 1;
      rmlt += rmlt < 0 ? WHLODDCRC : 0; let sn = s + p * rmlt;
      let snd = (sn / WHLODDCRC) | 0; let snm = (sn - snd * WHLODDCRC) | 0;
      mltsarr[WHLNDXS[snm]] = rmlt | 0; // new rmlts 0..209!
    }
    let ondx = (pi * WHLHITS) | 0
    for (let si = 0; si < WHLHITS; ++si) {
      let s0 = (RESIDUES[si] - FRSTSVPRM) >> 1; let sm0 = mltsarr[si];
      for (let ci = 0; ci < WHLHITS; ++ci) {
        let smr = mltsarr[ci];
        let rmlt = smr < sm0 ? smr + WHLODDCRC - sm0 : smr - sm0;
        let sn = s0 + p * rmlt; let rofs = (sn / WHLODDCRC) | 0;
        // we take the multiplier times 2 so it multiplies by the odd wheel index...
        arr[ci][ondx + si] = ((rmlt << 9) | (rofs | 0)) >>> 0;
      }
    }
  }
  return arr;
}();
const PTRNLEN = (11 * 13 * 17 * 19) | 0;
const PTRNNDXDPRMS = new Uint32Array([ // the wheel index plus the modulo index
  (-1 << 6) + 44, (-1 << 6) + 45, (-1 << 6) + 46, (-1 << 6) + 47 ]);

function makeSieveBuffer(szbits) { // round up to 32 bit boundary!
  let arr = new Array(WHLHITS); let sz = ((szbits + 31) >> 5) << 2;
  for (let ri = 0; ri < WHLHITS; ++ri) arr[ri] = new Uint8Array(sz);
  return arr;
}

function cullSieveBuffer(lwi, bps, prmstrts, sb) {
  let len = sb[0].length; let szbits = len << 3;
  let lowndx = lwi * WHLODDCRC; let nxti = (lwi + szbits) * WHLODDCRC;
  // set up prmstrts for use by each modulo residue bit plane...
  for (let pi = 0, bpslmt = bps.length; pi < bpslmt; ++pi) {
    let ndxdprm = bps[pi];
    let prmndx = ndxdprm & 0x3F; let pd = ndxdprm >> 6;
    let rsd = RESIDUES[prmndx]; let bp = pd * (WHLODDCRC << 1) + rsd;
    let i = (bp - FRSTSVPRM) >> 1;
    let s = ((i << 1) * (i + FRSTSVPRM) + (FRSTSVPRM * ((FRSTSVPRM - 1) >> 1))) | 0;
    if (s >= nxti) { prmstrts[pi] = 0xFFFFFFFF; break; } // enough base primes!
    if (s >= lowndx) s -= lowndx | 0;
    else {
      let wp = (rsd - FRSTSVPRM) >> 1; let r = (lowndx - s) % (WHLODDCRC * bp);
      s = r == 0
        ? 0 | 0
        : (bp * (WHLRNDUPS[(wp + ((r + bp - 1) / bp) | 0) | 0] - wp) - r) | 0;
    }
    let sd = (s / WHLODDCRC) | 0; let sn = WHLNDXS[(s - sd * WHLODDCRC) | 0];
    prmstrts[pi | 0] = (sn << 26) | sd;
  }
//  if (szbits == 131072) return;
  for (let ri = 0; ri < WHLHITS; ++ri) {
    let pln = sb[ri]; let plnstrts = WHLSTARTS[ri];
    for (let pi = 0, bpslmt = bps.length; pi < bpslmt; ++pi) {
      let prmstrt = prmstrts[pi | 0]; if (prmstrt == 0xFFFFFFFF) break;
      let ndxdprm = bps[pi | 0];
      let prmndx = ndxdprm & 0x3F; let pd = ndxdprm >> 6;
      let bp = (((pd * (WHLODDCRC << 1)) | 0) + RESIDUES[prmndx]) | 0;
      let sd = prmstrt & 0x3FFFFFF; let sn = prmstrt >>> 26;
      let adji = (prmndx * WHLHITS + sn) | 0; let adj = plnstrts[adji];
      sd += ((((adj >> 8) * pd) | 0) + (adj & 0xFF)) | 0;
      if (bp < 64) {
        for (let slmt = Math.min(szbits, sd + (bp << 3)) | 0; sd < slmt; sd += bp) {
          let msk = (1 << (sd & 7)) >>> 0;
//          for (let c = sd >> 3, clmt = len == 16384 ? 0 : len; c < clmt; c += bp) pln[c] |= msk;
          for (let c = sd >> 3; c < len; c += bp) pln[c] |= msk;
        }
      }
//      else for (let sdlmt = szbits == 131072 ? 0 : szbits; sd < sdlmt; sd += bp) pln[sd >> 3] |= (1 << (sd & 7)) >>> 0;
      else for (; sd < szbits; sd += bp) pln[sd >> 3] |= (1 << (sd & 7)) >>> 0;
    }
  }
}

const WHLPTRN = function () {
  let sb = makeSieveBuffer((PTRNLEN + 16384) << 3); // avoid overflow when filling!
  cullSieveBuffer(0, PTRNNDXDPRMS, new Uint32Array(PTRNNDXDPRMS.length), sb);
  return sb;
}();

const CLUT = function () {
  let arr = new Uint8Array(65536);
  for (let i = 0; i < 65536; ++i) {
    let nmbts = 0|0; let v = i;
    while (v > 0) { ++nmbts; v &= (v - 1)|0; }
    arr[i] = nmbts|0;
  }
  return arr;
}();

function countSieveBuffer(bitlmt, sb) {
  let lstwi = (bitlmt / WHLODDCRC) | 0;
  let lstri = WHLNDXS[(bitlmt - lstwi * WHLODDCRC) | 0];
  let lst = lstwi >> 5; let lstm = lstwi & 31;
  let count = (lst * 32 + 32) * WHLHITS;
  for (let ri = 0; ri < WHLHITS; ++ri) {
    let pln = new Uint32Array(sb[ri].buffer);
    for (let i = 0; i < lst; ++i) {
      let v = pln[i]; count -= CLUT[v & 0xFFFF]; count -= CLUT[v >>> 16];
    }
    let msk = 0xFFFFFFFF << lstm; if (ri <= lstri) msk <<= 1;
    let v = pln[lst] | msk; count -= CLUT[v & 0xFFFF]; count -= CLUT[v >>> 16];
  }
  return count;
}

function fillSieveBuffer(lwi, sb) {
  let mod = (lwi >> 3) % PTRNLEN;
  for (let ri = 0; ri < WHLHITS; ++ri) {
    let pln = sb[ri]; let len = pln.length;
    pln.set(WHLPTRN[ri].subarray(mod, mod + len));
  }
}

let startx = +Date.now();
const cmpsts = makeSieveBuffer(16384 << 3);
const bparr = function () {
  let szbits = (((((((Math.sqrt(LIMIT) | 0) - 23) >> 1) + WHLODDCRC - 1) / WHLODDCRC)
                                                                     + 31) >> 5) << 5;
  let cmpsts = makeSieveBuffer(szbits); fillSieveBuffer(0, cmpsts);
  let ndxdrsds = new Uint32Array(RESIDUES.length - 1);
  for (let i = 0; i < ndxdrsds.length; ++i) ndxdrsds[i] = i >>> 0;
  cullSieveBuffer(0, ndxdrsds, new Uint32Array(ndxdrsds.length), cmpsts);
  let len = countSieveBuffer(szbits * WHLODDCRC - 1, cmpsts);
  let ndxdprms = new Uint32Array(len); let j = 0;
  for (let i = 0; i < szbits; ++i)
    for (let ri = 0; ri < WHLHITS; ++ri)
      if ((cmpsts[ri][i >> 3] & (1 << (i & 7))) == 0) {
        ndxdprms[j++] = ((i << 6) + ri) >>> 0;
      }
  return ndxdprms;
}();
let count = 8; let lwilmt = (LIMIT - 23) / (2 * WHLODDCRC);
let strts = new Uint32Array(bparr.length);
for (let lwi = 0; lwi <= lwilmt; lwi += 131072) {
  let nxti = lwi + 131072;
  fillSieveBuffer(lwi, cmpsts);
  cullSieveBuffer(lwi, bparr, strts, cmpsts);
  if (nxti <= lwilmt) count += countSieveBuffer(131072 * WHLODDCRC - 1, cmpsts);
  else count += countSieveBuffer((LIMIT - 23 - lwi * (WHLODDCRC << 1)) >> 1, cmpsts);
}
let elpsdx = +Date.now() - startx;
console.log("Found " + count + " primes up to " + LIMIT + " in " + elpsdx + " milliseconds.");

The above code only runs about three and a half times faster than the Chapter 3 "odds-only" Page Segmentation code rather than the expected four times faster due to the following reasons:

  1. The speed up technique using a special simplified culling loop with a fixed mask pattern is no longer as effective as there are no longer hardly any small culling base primes; this increases the average number of clock cycles per composite number culling operation by about 20%, although this applies more to slower languages such as JavaScript than to the more efficient machine code compiling languages since they can use further techniques such as extreme loop unrolling to further reduce number of CPU clock cycles per culling operation loop to as little as about 1.25 clock cycles per culling loop.

  2. Although the overhead of counting the resulting primes is reduced by about a factor of two due to the less modulo bit planes (by about that factor), that is not the required factor of four; this is made worse in using JavaScript which does not have an means of tapping into the CPU POP_COUNT machine instruction which would be as about ten times faster than the Counting LUT (CLUT) technique used here.

  3. While the LUT techniques used here reduce the start address calculation overhead by a factor of five or so from what they would be for the more complex modulo calculations needed, these calculations are about two and a half to three times more complex than as required for the "odds-only" sieve in Chapter 3 so it isn't enough to give us a ratio-metric reduction; we would need a technique to reduce the time by a further factor of two or so in order for these calculation not to contribute to the reduced ratio. Believe me, I tried, but haven't been able to make this any better. That said, these calculations are likely to be more efficient in a more efficient computer language than JavaScript and/or a better CPU than my very low end Atom CPU processor, but then the composite number culling speed is likely to also be more efficient as well!

Still, a three and a half times speed up with only an increase in number of lines of code of about 50% isn't too bad, is it? This JavaScript code is only three to five times slower (depending on CPU, closer in performance on higher end CPU's) when run on newer versions of node/Google Chrome browser (Chrome version 75 is still about 25% faster than Firefox version 68) than Kim Walisch's "primesieve" written in "C" and compiled to x86_64 native code!

Coming is Chapter 4.5b that will be about two and a half times more code but will be a Prime Generator capable of sieving to extremely large ranges limited partly by that JavaScript can only efficiently represent numbers up to the 64-bit floating bit mantissa of 53 bits or about 9e15 and also be the amount of time one wants to wait: Sieving to 1e14 on a better CPU will take in the order of a day, but that isn't much of a problem - just keep a browser tab open!

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!