Poor performance with transpose and cumulative sum in Repa

后端 未结 1 1980
春和景丽
春和景丽 2021-02-07 04:46

I have developed a cumulative sum function as defined below in the Haskell library Repa. However, I have run into an issue when combining this function with the transpose opera

1条回答
  •  攒了一身酷
    2021-02-07 05:20

    From a library implementor's perspective, the way to debug this is to create a wrapper for the suspect operation, then look at the core code to see if fusion has worked.

    -- Main.hs ---------------------------------------------------
    import Solver
    import Data.Array.Repa.IO.BMP
    
    main 
     = do   Right img       <- readImageFromBMP "whatever.bmp"
            print $ cumsumBMP img
    
    -- Solver.hs --------------------------------------------------
    {-# LANGUAGE TypeOperators, FlexibleContexts, TypeFamilies #-}
    module Solver (cumsumBMP) where
    import Data.Array.Repa  as Repa
    import Data.Word
    
    {- all your defs -}
    
    {-# NOINLINE cumsumBMP #-}
    cumsumBMP :: Array DIM3 Word8 -> Array DIM3 Word8
    cumsumBMP img = cumsum $ transpose img
    

    I've put the "solver" code in a separate module, so we only have to wade through the core code for the definitions we care about.

    Compile like:

    touch Solver.hs ; ghc -O2 --make Main.hs \
     -ddump-simpl -dsuppress-module-prefixes -dsuppress-coercions  > dump
    

    Go to the definition of cumsumBMP and search for the letrec keyword. Searching for letrec is a quick way to find the inner loops.

    Not too far down I see this: (slightly reformatted)

    case gen_a1tr
    of _ {
      GenManifest vec_a1tv ->
        case sh2_a1tc  `cast` ... of _ { :. sh3_a1iu  sh4_a1iv ->
        case ix'_a1t9  `cast` ... of _ { :. sh1'_a1iz sh2'_a1iA ->
        case sh3_a1iu  `cast` ... of _ { :. sh5_X1n0  sh6_X1n2 ->
        case sh1'_a1iz `cast` ... of _ { :. sh1'1_X1n9 sh2'1_X1nb ->
        case sh5_X1n0             of _ { :. sh7_X1n8   sh8_X1na ->
        ...
        case sh2'1_X1nb           of _ { I# y3_X1nO ->
        case sh4_a1iv             of _ { I# y4_X1nP ->
        case sh2'_a1iA            of _ { I# y5_X1nX ->
        ...
        let { x3_a1x6 :: Int# [LclId]
          x3_a1x6 =
            +#
              (*#
                 (+#
                    (*#
                       y1_a1iM
                       y2_X1nG)
                    y3_X1nO)
                 y4_X1nP)
              y5_X1nX } in
        case >=#
               x3_a1x6
               0
        of ...
    

    Disaster! The x3_a1x6 binding is clearly doing some useful work (multiplications, additions and suchlike) but it's wrapped in a long series of unboxing operations that are also executed for every loop iteration. What's worse is that it's unboxing the length and width (shape) of the array at every iteration, and this information will always be the same. GHC should really float these case expressions out of the loop, but it doesn't yet. This is an instance of Issue #4081 on the GHC trac, which hopefully will be fixed sometime soon.

    The work around is to apply deepSeqArray to the incoming array. This places a demand on its value at the top level (outside the loop) which lets GHC know it's ok to move the case matches further up. For a function like cumsumBMP, we also expect the incoming array to already be manifest, so we can add an explicit case match for this:

    {-# NOINLINE cumsumBMP #-}
    cumsumBMP :: Array DIM3 Word8 -> Array DIM3 Word8
    cumsumBMP img@(Array _ [Region RangeAll (GenManifest _)])
      = img `deepSeqArray` cumsum $ transpose img
    

    Compiling again, the inner loop now looks much better:

    letrec {
    $s$wfoldlM'_loop_s2mW [...]
      :: Int# -> Word# -> Word# [...]
    $s$wfoldlM'_loop_s2mW =
      \ (sc_s2mA :: Int#) (sc1_s2mB :: Word#) ->
        case <=# sc_s2mA a_s2ji of _ {
          False -> sc1_s2mB;
          True ->
            $s$wfoldlM'_loop_s2mW
              (+# sc_s2mA 1)
              (narrow8Word#
                 (plusWord#
                    sc1_s2mB
                    (indexWord8Array#
                       rb3_a2gZ
                       (+#
                          rb1_a2gX
                          (+#
                             (*#
                                (+#
                                   (*#
                                      wild19_X1zO
                                      ipv1_X1m5)
                                   sc_s2mA)
                                ipv2_X1m0)
                             wild20_X1Ct)))))
        }; } in
    

    That's a tight, tail recursive loop that only uses primitive operations. Provided you compile with -fllvm -optlo-O3, there's no reason that won't run as fast as an equivalent C program.

    There's a slight hiccup when running it though:

    desire:tmp benl$ ./Main 
    Main: Solver.hs:(50,1)-(51,45): Non-exhaustive patterns in function cumsumBMP
    

    This just reminds us that we need to force the array before calling cumsumBMP.

    -- Main.hs ---------------------------------------------------
    ...
    import Data.Array.Repa as Repa
    main 
     = do   Right img       <- readImageFromBMP "whatever.bmp"
            print $ cumsumBMP $ Repa.force img
    

    In summary:

    1. You need to add some deepSeqArray and pattern matching goop to your top level functions to work around a current infelicity in GHC. This is demonstrated by the final version of the cumsumBMP function above. If you want GHC HQ to fix this soon then add yourself as a cc to Issue #4081 on the GHC trac. Repa programs will be much prettier when this is fixed.
    2. You don't need to add the goop to every function. In this example I didn't need to touch indexSlice and friends. The general rule is to add the goop to functions that use force, fold or sumAll. These functions instantiate the actual loops that operate over the array data, that is, they convert a delayed array to a manifest value.
    3. The performance of a piece of Repa code is determined as much by the context in which it's used as the actual code. If you pass your top level functions delayed arrays then they will run very slowly. There is more discussion of this in The Repa Tutorial.
    4. BMP files read with the repa-io library aren't pre-forced, so you need to force them before use. This is probably the wrong default, so I'll change it in the next version.

    0 讨论(0)
提交回复
热议问题