Haskell math performance on multiply-add operation

前端未结

关注

 2  1741

I\'m writing a game in Haskell, and my current pass at the UI involves a lot of procedural generation of geometry. I am currently focused on identifying performance of one parti

相关标签:

2条回答

情书的邮戳

2021-01-30 18:18

Roman Leschinkskiy responds:

Actually, the core looks mostly ok to me. Using unsafeIndex instead of (!) makes the program more than twice as fast (see my answer above). The program below is much faster, though (and cleaner, IMO). I suspect the remaining difference between this and the C program is due to GHC's general suckiness when it comes to floating point. The HEAD produces the best results with the NCG and -msse2

First, define a new Vec4 data type:

{-# LANGUAGE BangPatterns #-}

import Data.Vector.Storable
import qualified Data.Vector.Storable as V
import Foreign
import Foreign.C.Types

-- Define a 4 element vector type
data Vec4 = Vec4 {-# UNPACK #-} !CFloat
                 {-# UNPACK #-} !CFloat
                 {-# UNPACK #-} !CFloat
                 {-# UNPACK #-} !CFloat

Ensure we can store it in an array

instance Storable Vec4 where
  sizeOf _ = sizeOf (undefined :: CFloat) * 4
  alignment _ = alignment (undefined :: CFloat)

  {-# INLINE peek #-}
  peek p = do
             a <- peekElemOff q 0
             b <- peekElemOff q 1
             c <- peekElemOff q 2
             d <- peekElemOff q 3
             return (Vec4 a b c d)
    where
      q = castPtr p
  {-# INLINE poke #-}
  poke p (Vec4 a b c d) = do
             pokeElemOff q 0 a
             pokeElemOff q 1 b
             pokeElemOff q 2 c
             pokeElemOff q 3 d
    where
      q = castPtr p

Values and methods on this type:

a = Vec4 0.2 0.1 0.6 1.0
m = Vec4 0.99 0.7 0.8 0.6

add :: Vec4 -> Vec4 -> Vec4
{-# INLINE add #-}
add (Vec4 a b c d) (Vec4 a' b' c' d') = Vec4 (a+a') (b+b') (c+c') (d+d')

mult :: Vec4 -> Vec4 -> Vec4
{-# INLINE mult #-}
mult (Vec4 a b c d) (Vec4 a' b' c' d') = Vec4 (a*a') (b*b') (c*c') (d*d')

vsum :: Vec4 -> CFloat
{-# INLINE vsum #-}
vsum (Vec4 a b c d) = a+b+c+d

multList :: Int -> Vector Vec4 -> Vector Vec4
multList !count !src
    | count <= 0    = src
    | otherwise     = multList (count-1) $ V.map (\v -> add (mult v m) a) src

main = do
    print $ Data.Vector.Storable.sum
          $ Data.Vector.Storable.map vsum
          $ multList repCount
          $ Data.Vector.Storable.replicate arraySize (Vec4 0 0 0 0)

repCount, arraySize :: Int
repCount = 10000
arraySize = 20000

With ghc 6.12.1, -O2 -fasm:

1.752

With ghc HEAD (june 26), -O2 -fasm -msse2

1.708

This looks like the most idiomatic way to write a Vec4 array, and gets the best performance (11x faster than your original). (And this might become a benchmark for GHC's LLVM backend)

0 讨论(0)

轮回少年

2021-01-30 18:23

Well, this is better. 3.5s instead of 14s.

{-# LANGUAGE BangPatterns #-}
{-

-- multiply-add of four floats,
Vec4f multiplier, addend;
Vec4f vecList[];
for (int i = 0; i < count; i++)
    vecList[i] = vecList[i] * multiplier + addend;

-}

import qualified Data.Vector.Storable as V
import Data.Vector.Storable (Vector)
import Data.Bits

repCount, arraySize :: Int
repCount = 10000
arraySize = 20000

a, m :: Vector Float
a = V.fromList [0.2,  0.1, 0.6, 1.0]
m = V.fromList [0.99, 0.7, 0.8, 0.6]

multAdd :: Int -> Float -> Float
multAdd i v = v * (m `V.unsafeIndex` (i .&. 3)) + (a `V.unsafeIndex` (i .&. 3))

go :: Int -> Vector Float -> Vector Float
go n s
    | n <= 0    = s
    | otherwise = go (n-1) (f s)
  where
    f = V.imap multAdd

main = print . V.sum $ go repCount v
  where
    v :: Vector Float
    v = V.replicate (arraySize * 4) 0
            -- ^ a flattened Vec4f []

Which is better than it was:

$ ghc -O2 --make A.hs
[1 of 1] Compiling Main             ( A.hs, A.o )
Linking A ...

$ time ./A
516748.13
./A  3.58s user 0.01s system 99% cpu 3.593 total

multAdd compiles just fine:

        case readFloatOffAddr#
               rb_aVn
               (word2Int#
                  (and# (int2Word# sc1_s1Yx) __word 3))
               realWorld#
        of _ { (# s25_X1Tb, x4_X1Te #) ->
        case readFloatOffAddr#
               rb11_X118
               (word2Int#
                  (and# (int2Word# sc1_s1Yx) __word 3))
               realWorld#
        of _ { (# s26_X1WO, x5_X20B #) ->
        case writeFloatOffAddr#
               @ RealWorld
               a17_s1Oe
               sc3_s1Yz
               (plusFloat#
                  (timesFloat# x3_X1Qz x4_X1Te) x5_X20B)

However, you're doing 4-element at a time multiplies in the C code, so we'll need to do that directly, rather than faking it by looping and masking. GCC is probably unrolling the loop, too.

So to get identical performance, we'd need the vector multiply (a bit hard, possibly via the LLVM backend) and unroll the loop (possibly fusing it). I'll defer to Roman here to see if there's other obvious things.

One idea might be to actually use a Vector Vec4, rather than flattening it.

0 讨论(0)