I was reading an article of how slow Haskell is in playing with Collatz conjecture, which basically states if you keep multiplying by three and adding one to an odd number, or d
The performance of this program depends on several factors. If we get all of them right, the performance becomes the same as that of the C program. Going through these factors:
1. Using and comparing the right word sizes
The posted C code snippet is not exactly right; it uses 32-bit integers on all architectures, while the Haskell Int
-s are 64 bit on a 64-bit machine. Before anything else, we should make sure to use the same word size in both programs.
Also, we should always use native-sized integral types in our Haskell code. So if we're on a 64-bit system, we should use 64-bit numbers, and steer clear of Int32
-s and Word32
-s, unless there is a specific need for them. This is because operations on non-native integers are mostly implemented as foreign calls rather than primops, so they're significantly slower.
2. Division in collatzNext
div
is slower than quot
for Int
, because div
handles negative numbers differently. If we use div
and switch to Word
, the program gets faster, because div
is the same as quot
for Word
. quot
with Int
works just as fine. However, this is still not as fast as C. We can divide by two by shifting bits to the right. For some reason not even LLVM does this particular strength reduction in this example, so we're best off doing it by hand, replacing quot n 2
by shiftR n 1
.
3. Checking evenness
The fastest way to check this is by checking the least significant bit. LLVM can optimize even
to this, while the native codegen cannot. So, if we're on native codegen, even n
could be replaced with n .&. 1 == 0
, and this gives a nice performance boost.
However, I found something of a performance bug with GHC 7.10. Here we don't get an inlined even
for Word
, which wrecks performance (calling a function with a heap-allocated Word
box in the hottest part of the code does this). So here we should use rem n 2 == 0
or n .&. 1 == 0
instead of even
. The even
for Int
gets inlined fine though.
4. Fusing away lists in collatzLen
This is a crucial factor. The linked blog post is a bit out-of-date with regards to this. GHC 7.8 can't do fusion here, but 7.10 can. This means that with GHC 7.10 and LLVM we can conveniently get C-like performance without significantly modifying the original code.
collatzNext a = (if even a then a else 3*a+1) `quot` 2
collatzLen a0 = length $ takeWhile (/= 1) $ iterate collatzNext a0
maxColLen n = maximum $ map collatzLen [1..n]
main = do
[n] <- getArgs
print $ maxColLen (read n :: Int)
With ghc-7.10.1 -O2 -fllvm
and n = 10000000
, the above program runs in 2.8 seconds, while the equivalent C program runs in 2.4 seconds. If I compile the same code without LLVM, then I instead get 12.4 second runtime. This slowdown is entirely because of the lack of optimization on even
. If we use a .&. 1 == 0
, then the slowdown disappears.
5. Fusing away lists when computing the maximum length
Not even GHC 7.10 can do this, so we have to resort to manual loop-writing.
collatzNext a = (if a .&. 1 == 0 then a else 3*a+1) `shiftR` 1
collatzLen = length . takeWhile (/= 1) . iterate collatzNext
maxCol :: Int -> Int
maxCol = go 1 1 where
go ml i n | i > n = ml
go ml i n = go (max ml (collatzLen i)) (i + 1) n
main = do
[n] <- getArgs
print $ maxCol (read n :: Int)
Now, for ghc-7.10.1 -O2 -fllvm
and n = 10000000
, the above code runs in 2.1 seconds, while the C program runs in 2.4 seconds. If we want to achieve similar performance without LLVM and GHC 7.10, we just have to manually apply the important missing optimizations:
collatzLen :: Int -> Int
collatzLen = go 0 where
go l 1 = l
go l n | n .&. 1 == 0 = go (l + 1) (shiftR n 1)
| otherwise = go (l + 1) (shiftR (3 * n + 1) 1)
maxCol :: Int -> Int
maxCol = go 1 1 where
go ml i n | i > n = ml
go ml i n = go (max ml (collatzLen i)) (i + 1) n
main = do
[n] <- getArgs
print $ maxCol (read n :: Int)
Now, with ghc-7.8.4 -O2
and n = 10000000
, our code runs in 2.6 seconds.