I have developed a cumulative sum function as defined below in the Haskell library Repa. However, I have run into an issue when combining this function with the transpose opera
From a library implementor's perspective, the way to debug this is to create a wrapper for the suspect operation, then look at the core code to see if fusion has worked.
-- Main.hs ---------------------------------------------------
import Solver
import Data.Array.Repa.IO.BMP
main
= do Right img <- readImageFromBMP "whatever.bmp"
print $ cumsumBMP img
-- Solver.hs --------------------------------------------------
{-# LANGUAGE TypeOperators, FlexibleContexts, TypeFamilies #-}
module Solver (cumsumBMP) where
import Data.Array.Repa as Repa
import Data.Word
{- all your defs -}
{-# NOINLINE cumsumBMP #-}
cumsumBMP :: Array DIM3 Word8 -> Array DIM3 Word8
cumsumBMP img = cumsum $ transpose img
I've put the "solver" code in a separate module, so we only have to wade through the core code for the definitions we care about.
Compile like:
touch Solver.hs ; ghc -O2 --make Main.hs \
-ddump-simpl -dsuppress-module-prefixes -dsuppress-coercions > dump
Go to the definition of cumsumBMP
and search for the letrec
keyword. Searching for letrec
is a quick way to find the inner loops.
Not too far down I see this: (slightly reformatted)
case gen_a1tr
of _ {
GenManifest vec_a1tv ->
case sh2_a1tc `cast` ... of _ { :. sh3_a1iu sh4_a1iv ->
case ix'_a1t9 `cast` ... of _ { :. sh1'_a1iz sh2'_a1iA ->
case sh3_a1iu `cast` ... of _ { :. sh5_X1n0 sh6_X1n2 ->
case sh1'_a1iz `cast` ... of _ { :. sh1'1_X1n9 sh2'1_X1nb ->
case sh5_X1n0 of _ { :. sh7_X1n8 sh8_X1na ->
...
case sh2'1_X1nb of _ { I# y3_X1nO ->
case sh4_a1iv of _ { I# y4_X1nP ->
case sh2'_a1iA of _ { I# y5_X1nX ->
...
let { x3_a1x6 :: Int# [LclId]
x3_a1x6 =
+#
(*#
(+#
(*#
y1_a1iM
y2_X1nG)
y3_X1nO)
y4_X1nP)
y5_X1nX } in
case >=#
x3_a1x6
0
of ...
Disaster! The x3_a1x6
binding is clearly doing some useful work (multiplications, additions and suchlike) but it's wrapped in a long series of unboxing operations that are also executed for every loop iteration. What's worse is that it's unboxing the length and width (shape) of the array at every iteration, and this information will always be the same. GHC should really float these case expressions out of the loop, but it doesn't yet. This is an instance of Issue #4081 on the GHC trac, which hopefully will be fixed sometime soon.
The work around is to apply deepSeqArray
to the incoming array. This places a demand on its value at the top level (outside the loop) which lets GHC know it's ok to move the case matches further up. For a function like cumsumBMP
, we also expect the incoming array to already be manifest, so we can add an explicit case match for this:
{-# NOINLINE cumsumBMP #-}
cumsumBMP :: Array DIM3 Word8 -> Array DIM3 Word8
cumsumBMP img@(Array _ [Region RangeAll (GenManifest _)])
= img `deepSeqArray` cumsum $ transpose img
Compiling again, the inner loop now looks much better:
letrec {
$s$wfoldlM'_loop_s2mW [...]
:: Int# -> Word# -> Word# [...]
$s$wfoldlM'_loop_s2mW =
\ (sc_s2mA :: Int#) (sc1_s2mB :: Word#) ->
case <=# sc_s2mA a_s2ji of _ {
False -> sc1_s2mB;
True ->
$s$wfoldlM'_loop_s2mW
(+# sc_s2mA 1)
(narrow8Word#
(plusWord#
sc1_s2mB
(indexWord8Array#
rb3_a2gZ
(+#
rb1_a2gX
(+#
(*#
(+#
(*#
wild19_X1zO
ipv1_X1m5)
sc_s2mA)
ipv2_X1m0)
wild20_X1Ct)))))
}; } in
That's a tight, tail recursive loop that only uses primitive operations. Provided you compile with -fllvm -optlo-O3
, there's no reason that won't run as fast as an equivalent C program.
There's a slight hiccup when running it though:
desire:tmp benl$ ./Main
Main: Solver.hs:(50,1)-(51,45): Non-exhaustive patterns in function cumsumBMP
This just reminds us that we need to force the array before calling cumsumBMP
.
-- Main.hs ---------------------------------------------------
...
import Data.Array.Repa as Repa
main
= do Right img <- readImageFromBMP "whatever.bmp"
print $ cumsumBMP $ Repa.force img
In summary:
deepSeqArray
and pattern matching goop to your top level
functions to work around a current infelicity in GHC. This is demonstrated by
the final version of the cumsumBMP
function above. If you want GHC HQ to fix
this soon then add yourself as a cc to Issue #4081 on the GHC trac. Repa programs will be much prettier when this is fixed.indexSlice
and friends. The general rule is to add the goop to functions that use force
, fold
or sumAll
. These functions instantiate the actual loops that operate over the array data, that is, they convert a delayed array to a manifest value.