What's the ideal implementation for the Sieve of Eratosthenes between Lists, Arrays, and Mutable Arrays?

后端未结

关注

 3  373

In Haskell, I\'ve found three simple implementations of the Sieve of Eratosthenes on the Rosetta Code page.

Now my question is, which one should be used in w

相关标签:

3条回答

天命终不由人

2021-01-02 20:41

As has been said, using mutable arrays has the best performance. The following code is derived from this "TemplateHaskell" version converted back to something more in line with the straight mutable Array solution as the "TemplateHaskell" doesn't seem to make any difference, with some further optimizations. I believe it is faster than the usual mutable unboxed array versions due to the further optimizations and especially due to the use of the "unsafeRead" and "unsafeWrite" functions that avoid range checking of the arrays, probably also internally using pointers for array access:

{-# OPTIONS -O2 -optc-O3 #-} import Control.Monad import Control.Monad.ST import Data.Array.ST import Data.Array.Unboxed import Data.Array.Base primesToUA :: Word32-> [Word32] primesToUA top = do let sieveUA top = runSTUArray $ do let m = ((fromIntegral top) - 3) `div` 2 :: Int buf <- newArray (0,m) True -- :: ST s (STUArray s Int Bool) let cullUA i = do let p = i + i + 3 strt = p * (i + 1) + i let cull j = do if j > m then cullUA (i + 1) else do unsafeWrite buf j False cull (j + p) if strt > m then return () else do e <- unsafeRead buf i if e then cull strt else cullUA (i + 1) cullUA 0; return buf if top > 1 then 2 : [2 * (fromIntegral i) + 3 | (i,True) <- assocs $ sieveUA top] else [] main = do x <- read `fmap` getLine -- 1mln 2mln 10mln 100mln print (last (primesToUA x)) -- 0.01 0.02 0.09 1.26 seconds

EDIT: The above code has been corrected and the times below edited to reflect the correction and as well the comments comparing the paged to the non-paged version have been edited.

The times to run this to the indicated top ranges are as shown in the comment table at the bottom of the above source code as measured by ideone.com and are about exactly five times faster than the answer posted by @WillNess as also measured at ideone.com. This code takes a trivial amount of time to cull the primes to two million and only 1.26 seconds to cull to a hundred million. These times are about 2.86 times faster when run on my i7 (3.5 GHz) at 0.44 seconds to a hundred million, and takes 6.81 seconds to run to one billion. The memory use is just over six megabytes for the former and sixty megabytes for the latter, which is the memory used by the huge (bit packed) array. This array also explains the non-linear performance in that as the array size exceeds the CPU cache sizes the average memory access times get worse per composite number representation cull.

EDIT_ADD: A page segmented sieve is more efficient in that it has better memory access efficiency when the buffer size is kept smaller than the L1 or L2 CPU caches, and as well has the advantage that it is unbounded so that the upper range does not have to be specified in advance and a much smaller memory footprint being just the base primes less than the square root of the range used plus the page buffer memory. The following code has been written as a page segmented implementation and is somewhat faster than the non-paged version; it also offers the advantage that one can change the output range specification at the top of the code to 'Word64' from 'Word32' so it is then not limited to the 32-bit number range, at only a slight cost in processing time (for 32-bit compiled code) for any range that is in common. The code is as follows:

-- from http://www.haskell.org/haskellwiki/Prime_numbers#Using_ST_Array {-# OPTIONS -O2 -optc-O3 #-} import Data.Word import Control.Monad import Control.Monad.ST import Data.Array.ST import Data.Array.Unboxed import Data.Array.Base primesUA :: () -> [Word32] primesUA () = do let pgSZBTS = 262144 * 8 let sieveUA low bps = runSTUArray $ do let nxt = (fromIntegral low) + (fromIntegral pgSZBTS) buf <- newArray (0,pgSZBTS - 1) True -- :: ST s (STUArray s Int Bool) let cullUAbase i = do let p = i + i + 3 strt = p * (i + 1) + i when (strt < pgSZBTS) $ do e <- unsafeRead buf i if e then do let loop j = do if j < pgSZBTS then do unsafeWrite buf j False loop (j + p) else cullUAbase (i + 1) loop strt else cullUAbase (i + 1) let cullUA ~(p:t) = do let bp = (fromIntegral p) i = (bp - 3) `div` 2 s = bp * (i + 1) + i when (s < nxt) $ do let strt = do if s >= low then fromIntegral (s - low) else do let b = (low - s) `rem` bp if b == 0 then 0 else fromIntegral (bp - b) let loop j = do if j < pgSZBTS then do unsafeWrite buf j False loop (j + (fromIntegral p)) else cullUA t loop strt if low <= 0 then cullUAbase 0 else cullUA bps return buf let sieveList low bps = do [2 * ((fromIntegral i) + low) + 3 | (i,True) <- assocs $ sieveUA low bps] let sieve low bps = do (sieveList low bps) ++ sieve (low + (fromIntegral pgSZBTS)) bps let primes' = ((sieveList 0 []) ++ sieve (fromIntegral pgSZBTS) primes') :: [Word32] 2 : sieve 0 primes' main = do x <- read `fmap` getLine -- 1mln 2mln 10mln 100mln -- 0.02 0.03 0.13 1.13 seconds print (length (takeWhile ((>=) (fromIntegral x)) (primesUA ())))

The above code has quite a few more lines of code than the non-paged case due to the need to cull the composite number representations from the first page array differently than succeeding pages. The code also has the fixes so that there aren't memory leaks due to the base primes list and the output list now not being the same list (thus avoiding holding onto the whole list in memory).

Note that this code takes close to linear time (over the range sieved) as the range gets larger due to the culling buffer being of a constant size less than the L2 CPU cache. Memory use is a fraction of that used by the non-paged version at just under 600 kilobytes for a hundred million and just over 600 kilobytes for one billion, which slight increase is just the extra space required for the base primes less than the square root of the range list.

On ideone.com this code produces the number of primes to a hundred million in about 1.13 seconds and about 12 seconds to one billion (32-bit setting). Probably wheel factorization and definitely multi-core processing would make it even faster on a multi-core CPU. On my i7 (3.5 GHz), it takes 0.44 seconds to sieve to a hundred million and 4.7 seconds to one billion, with the roughly linear performance with increasing range as expected. It seems that there is some sort of non-linear overhead in the version of GHC run on ideone.com that has some performance penalty for larger ranges that is not present for the i7 and that is perhaps related to different garbage collection, as the page buffers are being created new for each new page. END_EDIT_ADD

EDIT_ADD2: It seems that much of the processing time for the above page segmented code is used in (lazy) list processing, so the code is accordingly reformulated with several improvements as follows:

Implemented a prime counting function that does not use list processing and uses "popCount" table look ups to count the number of 'one' bits in a 32 bit word at a time. In this way, the time to find the results is insignificant compared to the actual sieve culling time.

Stored the base primes as a list of bit packed page segments, which is much more space efficient than storing a list of primes, and the the time to convert the page segments to primes as required is not much of a computational overhead.

Tuned the prime segment make function so that for the initial zero page segment, it uses its own bit pattern as a source page, thus making the composite number cull code shorter and simpler.

The code then becomes as follows:

{-# OPTIONS -O3 -rtsopts #-} -- -fllvm ide.com doesn't support LLVM import Data.Word import Data.Bits import Control.Monad import Control.Monad.ST import Data.Array.ST (runSTUArray) import Data.Array.Unboxed import Data.Array.Base pgSZBTS = (2^18) * 8 :: Int -- size of L2 data cache type PrimeType = Word32 type Chunk = UArray PrimeType Bool -- makes a new page chunk and culls it -- if the base primes list provided is empty then -- it uses the current array as source (for zero page base primes) mkChnk :: Word32 -> [Chunk] -> Chunk mkChnk low bschnks = runSTUArray $ do let nxt = (fromIntegral low) + (fromIntegral pgSZBTS) buf <- nxt `seq` newArray (fromIntegral low, fromIntegral nxt - 1) True let cull ~(p:ps) = let bp = (fromIntegral p) i = (bp - 3) `shiftR` 1 s = bp * (i + 1) + i in let cullp j = do if j >= pgSZBTS then cull ps else do unsafeWrite buf j False cullp (j + (fromIntegral p)) in when (s < nxt) $ do let strt = do if s >= low then fromIntegral (s - low) else do let b = (low - s) `rem` bp if b == 0 then 0 else fromIntegral (bp - b) cullp strt case bschnks of [] -> do bsbf <- unsafeFreezeSTUArray buf cull (listChnkPrms [bsbf]) _ -> cull $ listChnkPrms bschnks return buf -- creates a page chunk list starting at the lw value chnksList :: Word32 -> [Chunk] chnksList lw = mkChnk lw basePrmChnks : chnksList (lw + fromIntegral pgSZBTS) -- converts a page chunk list to a list of primes listChnkPrms :: [Chunk] -> [PrimeType] listChnkPrms [] = [] listChnkPrms ~(hdchnk@(UArray lw _ rng _):tlchnks) = let nxtp i = if i >= rng then [] else if unsafeAt hdchnk i then (case ((lw + fromIntegral i) `shiftL` 1) + 3 of np -> np) : nxtp (i + 1) else nxtp (i + 1) in (hdchnk `seq` lw `seq` nxtp 0) ++ listChnkPrms tlchnks -- the base page chunk list used to cull the higher order pages, -- note that it has special treatment for the zero page. -- It is more space efficient to store this as chunks rather than -- as a list of primes or even a list of deltas (gaps), with the -- extra processing to convert as needed not too much. basePrmChnks :: [Chunk] basePrmChnks = mkChnk 0 [] : chnksList (fromIntegral pgSZBTS) -- the full list of primes could be accessed with the following function. primes :: () -> [PrimeType] primes () = 2 : (listChnkPrms $ chnksList 0) -- a quite fast prime counting up to the given limit using -- chunk processing to avoid innermost list processing loops. countPrimesTo :: PrimeType -> Int countPrimesTo limit = let lmtb = (limit - 3) `div` 2 in let sumChnks acc chnks@(chnk@(UArray lo hi rng _):chnks') = let cnt :: UArray PrimeType Word32 -> Int cnt bfw = case if lmtb < hi then fromIntegral (lmtb - lo) else rng of crng -> case crng `shiftR` 5 of rngw -> let cnt' i ac = ac `seq` if i >= rngw then if (i `shiftL` 5) >= rng then ac else case (-2) `shiftL` fromIntegral (lmtb .&. 31) of msk -> msk `seq` case (unsafeAt bfw rngw) .&. (complement msk) of bts -> bts `seq` case popCount bts of c -> c `seq` case ac + c of nac -> nac else case ac + (popCount $ unsafeAt bfw i) of nacc -> nacc `seq` cnt' (i + 1) (nacc) in cnt' 0 0 in acc `seq` case runST $ do -- make UArray _ Bool into a UArray _ Word32 stbuf <- unsafeThawSTUArray chnk stbufw <- castSTUArray stbuf bufw <- unsafeFreezeSTUArray stbufw return $ cnt bufw of c -> c `seq` case acc + c of nacc -> nacc `seq` if hi >= lmtb then nacc else sumChnks nacc chnks' in if limit < 2 then 0 else if limit < 3 then 1 else lmtb `seq` sumChnks 1 (chnksList 0) main = do x <- read `fmap` getLine -- 1mln 2mln 10mln 100mln 1000mln -- 0.02 0.03 0.06 0.45 4.60 seconds -- 7328 7328 8352 8352 9424 Kilobytes -- this takes 14.34 seconds and 9424 Kilobytes to 3 billion on ideone.com, -- and 9.12 seconds for 3 billion on an i7-2700K (3.5 GHz). -- The above ratio of about 1.6 is much better than usual due to -- the extremely low memory use of the page segmented algorithm. -- It seems thaat the Windows Native Code Generator (NCG) from GHC -- is particularly poor, as the Linux 32-bit version takes -- less than two thirds of the time for exactly the same program... print $ countPrimesTo x -- print $ length $ takeWhile ((>=) x) $ primes () -- the slow way to do this

The times and memory requirements given in the code are as observed when run on ideone.com, with 0.02, 0.03, 0.05, 0.30, 3.0, and 9.1 seconds required to run on my i7-2700K (3.5 GHz) for one, two, ten, a hundred, a thousand (one billion), and three thousand (three billion) million range, respectively, with a pretty much constant memory footprint slowly increasing with the number of base primes less than the square root of the range as required. When compiled with the LLVM compiler back end, these times become 0.01, 0.02, 0.02, 0.12, 1.35, and 4.15 seconds, respectively, due to its more efficient use of registers and machine instructions; this last is quite close to the same speed as if compiled with a 64-bit compiler rather than the 32-bit compiler used as the efficient use of registers means that availability of extra registers doesn't make much difference.

As commented in the code, the ratio between performance on my real machine and the ideone.com servers becomes much less than for much more memory wasteful algorithms due to not being throttled by memory access bottlenecks so the limit to speed is mostly just the ratio of CPU clock speeds as well as CPU processing efficiency per clock cycle. However, as commented there, there is some strange inefficiency with the GHC Native Code Generator (NCG) when run under Windows (32-bit compiler) in that the run times are over 50% slower than if run under Linux (as the ideone.com server uses). AFAIK they both have a common code base for the same Haskell GHC version and the only divergence is in the linker used (which is also used with the LLVM backend, which is not affected) as GHC NCG does not use GCC but only the mingw32 assembler, which should also be the same.

Note that this code when compiled with the LLVM compiler back end is about the same speed as the same algorithm written for highly optimized 'C/C++' implementations indicating that Haskell really has the ability to develop very tight loop coding. It might be said that the Haskell code is quite a bit more readable and secure than equivalent 'C/C++' code once one gets used to the Haskell paradigms of monadic and non-strict code. The further refinements in execution speed for the Sieve of Eratosthenes are purely a function of the tuning of the implementations used and not the choice of language between Haskell and 'C/C++'.

Summary: Of course, this isn't yet the ultimate in speed for a Haskell version of the Sieve of Eratosthenes in that we still haven't tuned the memory access to more efficiently use the fast CPU L1 cache, nor have we significantly reduced the total number of composite culling operations necessary using extreme wheel factorization other than to eliminate the odds processing. However, this is enough to answer the question in showing that mutable arrays are the most efficient way of addressing such tight loop type of problems, with potential speed gains of about 100 times over the use of lists or immutable arrays. END_EDIT_ADD2

0 讨论(0)

发布评论:

提交评论

加载中...

深忆病人

2021-01-02 20:55

This

primes = sieve [2..] where sieve (p:x) = p : sieve [ n | n <- x, n `mod` p > 0 ]

is not a sieve. It's very inefficient trial division. Don't use that!

I'm curious about how you got your times, there is no way that the Turner "sieve" could produce the primes not exceeding 2,000,000 in mere seconds. Letting it find the primes to 200,000 took

MUT time 6.38s ( 6.39s elapsed) GC time 9.19s ( 9.20s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 15.57s ( 15.59s elapsed)

on my box (64-bit Linux, ghc-7.6.1, compiled with -O2). The complexity of that algorithm is O(N² / log² N), almost quadratic. Letting it proceed to 2,000,000 would take about twenty minutes.

Your times for the array versions are suspicious too, though in the other direction. Did you measure interpreted code?

Sieving to 2,000,000, compiled with optimisations, the mutable array code took 0.35 seconds to run, and the immutable array code 0.12 seconds.

Now, that still has the mutable array about three times slower than the immutable array.

But, it's an unfair comparison. For the immutable array, you used Ints, and for the mutable array Integers. Changing the mutable array code to use Ints - as it should, since under the hood, arrays are Int-indexed, so using Integer is an unnecessary performance sacrifice that buys nothing - made the mutable array code run in 0.15 seconds. Close to the mutable array code, but not quite there. However, you let the mutable array do more work, since in the immutable array code you only eliminate odd multiples of the odd primes, but in the mutable array code, you mark all multiples of all primes. Changing the mutable array code to treat 2 specially, and only eliminate odd multiples of odd primes brings that down to 0.12 seconds too.

But, you're using range-checked array indexing, which is slow, and, since the validity of the indices is checked in the code itself, unnecessary. Changing that to using unsafeRead and unsafeWrite brings down the time for the immutable array to 0.09 seconds.

Then you have the problem that using

forM_ [x, y .. z]

uses boxed Ints (fortunately, GHC eliminates the list). Writing a loop yourself, so that only unboxed Int#s are used, the time goes down to 0.02 seconds.

{-# LANGUAGE MonoLocalBinds #-} import Control.Monad (forM_, when) import Control.Monad.ST import Data.Array.ST import Data.Array.Unboxed import Data.Array.Base primeSieve :: Int -> UArray Int Bool primeSieve top = runSTUArray $ do a <- newArray (0,top) True unsafeWrite a 0 False unsafeWrite a 1 False let r = ceiling . sqrt $ fromIntegral top mark step idx | top < idx = return () | otherwise = do unsafeWrite a idx False mark step (idx+step) sift p | r < p = return a | otherwise = do prim <- unsafeRead a p when prim $ mark (2*p) (p*p) sift (p+2) mark 2 4 sift 3 -- Return primes from sieve as list: primesTo :: Int -> [Int] primesTo top = [p | (p,True) <- assocs $ primeSieve top] main :: IO () main = print .last $ primesTo 2000000

So, wrapping up, for a Sieve of Eratosthenes, you should use an array - not surprising, its efficiency depends on being able to step from one multiple to the next in short constant time.

You get very simple and straightforward code with immutable arrays, and that code performs decently for not too high limits (it gets relatively worse for higher limits, since the arrays are still copied and garbage-collected, but that's not too bad).

When you need better performance, you need mutable arrays. Writing efficient mutable array code is not entirely trivial, one has to know how the compiler translates the various constructs to choose the right one, and some would consider such code unidiomatic. But you can also use a library (disclaimer: I wrote it) that provides a fairly efficient implementation rather than writing it yourself.

0 讨论(0)

发布评论:

提交评论

加载中...

天命终不由人

2021-01-02 20:55

Mutable array will always be the winner in terms of performance (and you really should've copied the version that works on odds only as a minimum; it should be the fastest of the three - also because it uses Int and not Integer).

For lists, tree-shaped merging incremental sieve should perform better than the one you show. You can always use it with takeWhile (< limit) if needed. I contend that it conveys the true nature of the sieve most clearly:

import Data.List (unfoldr) primes :: [Int] primes = 2 : _Y ((3 :) . gaps 5 . _U . map (\p -> [p*p, p*p+2*p..])) _Y g = g (_Y g) -- recursion _U ((x:xs):t) = (x :) . union xs . _U -- ~= nub . sort . concat . unfoldr (\(a:b:c) -> Just (union a b, c)) $ t gaps k s@(x:xs) | k < x = k : gaps (k+2) s -- ~= [k,k+2..]\\s, when | otherwise = gaps (k+2) xs -- k<=x && null(s\\[k,k+2..]) union a@(x:xs) b@(y:ys) = case compare x y of -- ~= nub . sort .: (++) LT -> x : union xs b EQ -> x : union xs ys GT -> y : union a ys

_U reimplements Data.List.Ordered.unionAll, and gaps 5 is (minus [5,7..]), fused for efficiency, with minus and union from the same package.

Of course nothing beats the brevity of Data.List.nubBy (((>1).).gcd) [2..] (but it's very slow).

To your 1st new question: not. It does find the multiples by counting up, as any true sieve should _{(although "minus" on lists is of course under-performant; the above improves on that by re-arranging a linear subtraction chain ((((xs-a)-b)-c)- ... ) into a subtraction of tree-folded additions, xs-(a+((b+c)+...))).}

0 讨论(0)

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复