Optimizing Haskell Inner Loops

后端 未结 1 2034
悲&欢浪女
悲&欢浪女 2021-02-20 17:09

Still working on my SHA1 implementation in Haskell. I\'ve now got a working implementation and this is the inner loop:

iterateBlock\' :: Int -> [Word32] ->         


        
1条回答
  •  粉色の甜心
    2021-02-20 17:55

    Looking at the core produced by ghc-7.2.2, the inlining works out well. What doesn't work so well is that in each iteration a couple of Word32 values are first unboxed, to perform the work, and then reboxed for the next iteration. Unboxing and re-boxing can cost a surprisingly large amount of time (and allocation). You can probably avoid that by using Word instead of Word32. You couldn't use rotate from Data.Bits then, but would have to implement it yourself (not hard) to have it work also on 64-bit systems. For a' you would have to manually mask out the high bits.

    Another point that looks suboptimal is that in each iteration t is compared to 19, 39 and 59 (if it's large enough), so that the loop body contains four branches. It will probably be faster if you split iterateBlock' into four loops (0-19, 20-39, 40-59, 60-79) and use constants k1, ..., k4, and four functions f1, ..., f4 (without the t parameter) to avoid branches and have smaller code-size for each loop.

    And, as Thomas said, using a list for the block data isn't optimal, an unboxed Word array/vector would probably help too.

    With the bang patterns, the core looks much better. Two or three less-than-ideal points remain.

                          (GHC.Prim.narrow32Word#
                             (GHC.Prim.plusWord#
                                (GHC.Prim.narrow32Word#
                                   (GHC.Prim.plusWord#
                                      (GHC.Prim.narrow32Word#
                                         (GHC.Prim.plusWord#
                                            (GHC.Prim.narrow32Word#
                                               (GHC.Prim.plusWord#
                                                  (GHC.Prim.narrow32Word#
                                                     (GHC.Prim.or#
                                                        (GHC.Prim.uncheckedShiftL# sc2_sEn 5)
                                                        (GHC.Prim.uncheckedShiftRL# sc2_sEn 27)))
                                                  y#_aBw))
                                            sc6_sEr))
                                      y#1_XCZ))
                                y#2_XD6))
    

    See all these narrow32Word#? They're cheap, but not free. Only the outermost is needed, there may be a bit to harvest by hand-coding the steps and using Word.

    Then the comparisons of t with 19, ..., they appear twice, once to determine the k constant, and once for the f transform. The comparisons alone are cheap, but they cause branches and without them, further inlining may be possible. I expect a bit could be gained here too.

    And still, the list. That means w can't be unboxed, the core could be simpler if w were unboxable.

    0 讨论(0)
提交回复
热议问题