Efficient bit-fiddling in a LFSR implementation

前端 未结 3 1018
梦谈多话
梦谈多话 2021-01-05 11:00

Although I have a good LSFR C implementation I thought I\'d try the same in Haskell - just to see how it goes. What I came up with, so far, is two orders of magnitude slower

3条回答
  •  花落未央
    2021-01-05 11:31

    Up Front Matters

    For starters, I'm using GHC 8.0.1 on an Intel I5 ~2.5GHz, linux x86-64.

    First Draft: Oh No! The slows!

    Your starting code with parameter 25 runs:

    % ghc -O2 orig.hs && time ./orig 25
    [1 of 1] Compiling Main             ( orig.hs, orig.o )
    Linking orig ...
    OK
    ./orig 25  7.25s user 0.50s system 99% cpu 7.748 total
    

    So the time to beat is 77ms - two orders of magnitude better than this Haskell code. Lets dive in.

    Issue 1: Shifty Code

    I found a couple of oddities with the code. First was the use of shift in high performance code. Shift supports both left and right shift and to do so it requires a branch. Lets kill that with more readable powers of two and such (shift 1 x ~> 2^x and shift x 1 ~> 2*x):

    % ghc -O2 noShift.hs && time ./noShift 25
    [1 of 1] Compiling Main             ( noShift.hs, noShift.o )
    Linking noShift ...
    OK
    ./noShift 25  0.64s user 0.00s system 99% cpu 0.637 total
    

    (As you noted in the comments: Yes, this bears investigation. It might be that some oddity of the prior code was preventing a rewrite rule from firing and, as a result, much worse code resulted)

    Issue 2: Lists Of Bits? Int operations save the day!

    One change, one order of magnitude. Yay. What else? Well you have this awkward list of bit locations you're tapping that just seems like its begging for inefficiency and/or leans on fragile optimizations. At this point I'll note that hard-coding any one selection from that list results in really good performance (such as testBit lsfr 24 `xor` testBit lsfr 21) but we want a more general fast solution.

    I propose we compute the mask of all the tap locations then do a one-instruction pop count. To do this we only need a single Int passed in to advance instead of a whole list. The popcount instruction requires good assembly generation which requires llvm and probably -optlc-mcpu=native or another instruction set selection that is non-pessimistic.

    This step gives us pc below. I've folded in the guard-removal of advance that was mentioned in the comments:

    let tp = sum $ map ((2^) . subtract 1) (tap !! len)
        pc lfsr = fromEnum (even (popCount (lfsr .&. tp)))
        mask = 2^len - 1
        advance' :: Int -> Int
        advance' lfsr = (2*lfsr .&. mask) .|. pc lfsr 
        out :: Int
        out = last $ take (2^len) $ iterate advance' 0
    

    Our resulting performance is:

    % ghc -O2 so.hs -fforce-recomp -fllvm -optlc-mcpu=native && time ./so 25      
    [1 of 1] Compiling Main             ( so.hs, so.o )
    Linking so ...
    OK
    ./so 25  0.06s user 0.00s system 96% cpu 0.067 total
    

    That's over two orders of magnitude from start to finish, so hopefully it matches your C. Finally, in deployed code it is actually really common to have Haskell packages with C bindings but this is often an educational exercise so I hope you had fun.

    Edit: The now-available C++ code takes my system 0.10 (g++ -O3) and 0.12 (clang++ -O3 -march=native) seconds, so it seems we've beat our mark by a fair bit.

提交回复
热议问题