Although I have a good LSFR C implementation I thought I\'d try the same in Haskell - just to see how it goes. What I came up with, so far, is two orders of magnitude slower
Up Front Matters
For starters, I'm using GHC 8.0.1 on an Intel I5 ~2.5GHz, linux x86-64.
First Draft: Oh No! The slows!
Your starting code with parameter 25 runs:
% ghc -O2 orig.hs && time ./orig 25
[1 of 1] Compiling Main ( orig.hs, orig.o )
Linking orig ...
OK
./orig 25 7.25s user 0.50s system 99% cpu 7.748 total
So the time to beat is 77ms - two orders of magnitude better than this Haskell code. Lets dive in.
Issue 1: Shifty Code
I found a couple of oddities with the code. First was the use of shift
in high performance code. Shift supports both left and right shift and to do so it requires a branch. Lets kill that with more readable powers of two and such (shift 1 x
~> 2^x
and shift x 1
~> 2*x
):
% ghc -O2 noShift.hs && time ./noShift 25
[1 of 1] Compiling Main ( noShift.hs, noShift.o )
Linking noShift ...
OK
./noShift 25 0.64s user 0.00s system 99% cpu 0.637 total
(As you noted in the comments: Yes, this bears investigation. It might be that some oddity of the prior code was preventing a rewrite rule from firing and, as a result, much worse code resulted)
Issue 2: Lists Of Bits? Int operations save the day!
One change, one order of magnitude. Yay. What else? Well you have this awkward list of bit locations you're tapping that just seems like its begging for inefficiency and/or leans on fragile optimizations. At this point I'll note that hard-coding any one selection from that list results in really good performance (such as testBit lsfr 24 `xor` testBit lsfr 21
) but we want a more general fast solution.
I propose we compute the mask of all the tap locations then do a one-instruction pop count. To do this we only need a single Int
passed in to advance
instead of a whole list. The popcount instruction requires good assembly generation which requires llvm and probably -optlc-mcpu=native
or another instruction set selection that is non-pessimistic.
This step gives us pc
below. I've folded in the guard-removal of advance
that was mentioned in the comments:
let tp = sum $ map ((2^) . subtract 1) (tap !! len)
pc lfsr = fromEnum (even (popCount (lfsr .&. tp)))
mask = 2^len - 1
advance' :: Int -> Int
advance' lfsr = (2*lfsr .&. mask) .|. pc lfsr
out :: Int
out = last $ take (2^len) $ iterate advance' 0
Our resulting performance is:
% ghc -O2 so.hs -fforce-recomp -fllvm -optlc-mcpu=native && time ./so 25
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
OK
./so 25 0.06s user 0.00s system 96% cpu 0.067 total
That's over two orders of magnitude from start to finish, so hopefully it matches your C. Finally, in deployed code it is actually really common to have Haskell packages with C bindings but this is often an educational exercise so I hope you had fun.
Edit: The now-available C++ code takes my system 0.10 (g++ -O3
) and 0.12 (clang++ -O3 -march=native
) seconds, so it seems we've beat our mark by a fair bit.