How to make my Haskell program faster? Comparison with C

后端未结

关注

 4  1291

I\'m working on an implementation of one of the SHA3 candidates, JH. I\'m at the point where the algorithm pass all KATs (Known Answer Tests) provided by NIST, and have also mad

相关标签:

4条回答

醉话见心

2021-01-31 17:26

The lower graph shows that a lot of memory is occupied by lists. Unless there are more lurking in other modules, they can only come from e8. Maybe you'll have to bite the bullet and make that a loop instead of a fold, but for starters, since Block1024 is a pair, the foldl' doesn't do much evaluation on the fly (unless the strictness analyser has become significantly better). Try making that stricter, data Block1024 = B1024 !Block512 !Block512, perhaps it also needs {-# UNPACK #-} pragmas. In roundFunction, use rem instead of mod (this will only have minor impact, but it's a bit faster) and make the let bindings strict. In the swapN functions, you might get better performance giving the constants in the form W x y rather than as 128-bit hex numbers. I can't guarantee those changes will help, but that's what looks most promising after a short glance.

0 讨论(0)
发布评论:

提交评论
- 加载中...
伪装坚强ぢ

2021-01-31 17:26
It looks like you did a fair amount of tweaking already; I'm curious what the performance is like without explicit strictness annotations (BangPatterns) and the various compiler pragmas (UNPACK, INLINE)... Also, a dumb question: what optimization flags are you using?

Anyway, two suggestions which may be completely awful:
1. Use unboxed primitive types where you can (e.g. replace Data.Word.Word64 with GHC.Word.Word64#, make sure word128Shift is using Int#, etc.) to avoid heap allocation. This is, of course, non-portable.
2. Try Data.Sequence instead of []
At any rate, rather than looking at the Core output, try looking at the intermediate C files (*.hc) instead. It can be hard to wade through, but sometimes makes it obvious where the compiler wasn't quite as sharp as you'd hoped.
0 讨论(0)
发布评论:

提交评论
- 加载中...
囚心锁ツ

2021-01-31 17:29
- Switch to unboxed Vectors (from Array, used for constants)
- Use unsafeIndex instead of incurring the bounds check and data dependency from safe indexing (i.e. !)
- Unpack Block1024 as you did with Block512 (or at least use UnboxedTuples)
- Use unsafeShift{R,L} so you don't incur the check on the shift value (coming in GHC 7.4)
- Unfold the roundFunction so you have one rather ugly and verbose e8 function. This was significat in pureMD5 (the rolled version was prettier but massively slower than the unrolled version). You might be able to use TH to do this and keep the code smallish. If you do this then you'll have no need for constants as these values will be explicit in the code and result in a more cache friendly binary.
- Unpack your Word128 values.
- Define your own addition for Word128, don't lift Integer. See LargeWord for an example of how this can be done.
- rem not mod
- Compile with optimization (-O2) and try llvm (-fllvm)
EDIT: And cabalize your git repo along with a benchmark so we can help you easier ;-). Good work on including a crypto-api instance.
0 讨论(0)
发布评论:

提交评论
- 加载中...
有刺的猬

2021-01-31 17:31
Ok, so I thought I would chime in with an update of what I have done and the results obtained thus far. Changes made:
- Switched from Array to UnboxedArray (made Word128 an instance type)
- Used UnboxedArray + fold in e8 instead of lists and (prelude) fold
- Used unsafeIndex instead of !
- Changed type of Block1024 to a real datatype (similiar to Block512), and unpacked its arguments
- Updated GHC to version 7.2.1 on Arch Linux, thus fixing the problem with compiling via C or LLVM
- Switched mod to rem in some places, but NOT in roundFunction. When I do it there, the compile time suddenly takes an awful lot of time, and the run time becomes 10 times slower! Does anyone know why that may be? It is only happening with GHC-7.2.1, not GHC-7.0.3
I compile with the following options:

ghc-7.2.1 --make -O2 -funbox-strict-fields main.hs ./Tests/testframe.hs -fvia-C -optc-O2

And the results? Roughly 50 % reduction in time. On an input of ~107 MB, the code now use 3 minutes as compared to the previous 6-7 minutes. The C version uses 42 seconds.

Things I tried, but which didn't result in better performance:
- Unrolled the e8 function like this:
  e8 !h = go h 0
  
  where go !x !n
```
      | n == 42   = x
      | otherwise = go h' (n + 1)
      where !h' = roundFunction x n
```
- Tried breaking up the swapN functions to use the underlying Word64' directly:
  swap1 (W xh hl) =
```
     shiftL (W (xh .&. 0x5555555555555555) (xl .&. 0x5555555555555555)) 1 
     .|. 
     shiftR (W (xh .&. 0xaaaaaaaaaaaaaaaa) (xl .&. 0xaaaaaaaaaaaaaaaa)) 1 
```
- Tried using the LLVM backend
All of these attempts gave worse performance than what I have currently. I don't know if thats because I'm doing it wrong (especially the unrolling of e8), or because they just are worse options.

Still I have some new questions with these new tweaks.
1. Suddenly I have gotten this peculiar bump in memory usage. Take a look at following heap profiles:
  
  Why has this happened? Is it because of the UnboxedArray? And what does SYSTEM mean?
2. When I compile via C I get the following warning:
  
  Warning: The -fvia-C flag does nothing; it will be removed in a future GHC release
  
  Is this true? Why then, do I see better performance using it, rather than not?
0 讨论(0)
发布评论:

提交评论
- 加载中...