After few hours of debugging, I realized that a very simple toy example was not efficient due to a missing !
in an expression return $ 1 + x
(thanks du
There is a specific class of space leaks that can be detected because they use excessive amounts of stack when they unwind the excessive heap usage. The following website lists the specific approaches, along with lots of case studies, but roughly:
+RTS -K10K
to limit the stack to 10Kb.+RTS -xc
to get stack traces.It's not a perfect approach since sometimes you have memory leaks without excessive stack usage, and sometimes you have excessive stack usage without memory leaks, but the correspondence is pretty good and the tooling can be deployed on CI to stop introducing new leaks.
You ask
return $ 1 + x
[...] but how come ghc does not optimize that??
The answer is that strict evaluation and lazy evaluation have subtly different semantics, so having GHC optimise it might break your program.
The difference lies in the treatment of undefined values. Any attempt to evaluate an undefined
throws an exception. In GHCi:
Prelude> undefined
*** Exception: Prelude.undefined
CallStack (from HasCallStack):
error, called at libraries/base/GHC/Err.hs:79:14 in base:GHC.Err
undefined, called at <interactive>:1:1 in interactive:Ghci1
If I have an expression that contains an undefined then the same thing happens:
Prelude> 2 + undefined
*** Exception: Prelude.undefined [...]
However if the evaluation never gets to the undefined then everything is fine:
Prelude> True || undefined
True
Haskell uses "non-strict semantics" and "lazy evaluation". Technically the non-strict semantics are part of the definition of Haskell and lazy evaluation is the implementation mechanism in GHC, but you can think of them as synonyms. When you define a variable the value is not computed immediately, so if you never use the variable then you have no problem:
Prelude> let b = undefined
Prelude> b
*** Exception: Prelude.undefined
The let
works fine, but evaluating the variable it defines throws an exception.
Now consider your towering stack of unevaluated 1+
calls. GHC has no way of knowing in advance whether you are ever going to use the result (see below), and it also has no way of knowing whether or not there is an exception lurking in there somewhere. As a programmer you might know that there is an exception and carefully not look at the result, relying on the non-strict semantics of Haskell. If GHC prematurely evaluates and gets an exception your program will fail when it should not have.
Actually the GHC compiler includes a piece of optimisation called the Demand Analyser (it used to be called the Strictness Analyser) which looks for opportunities to optimise in exactly the way you want. However it has limits because it can only optimise computations when it can prove that the result is going to be evaluated.
Another wrinkle here is that you have used the State monad. This actually comes in two variants; Lazy and Strict. The Strict variant forces the state when it gets written, but the Lazy variant (the default) doesn't.
The main method for detecting memory leaks is heap profiling. Specifically, you're looking for unexpected growth in the amount of resident (mostly heap) memory, either the maximum residency in the +RTS -s
statistics output, or -- more reliably -- a characteristic "pyramid" shape over time in heap profile output generated with the +RTS -h<x>
flags and the hp2ps
tool.
If I run your toy program with +RTS -s
, I see:
3,281,896,520 bytes allocated in the heap
3,383,195,568 bytes copied during GC
599,346,304 bytes maximum residency (17 sample(s))
5,706,584 bytes maximum slop
571 MB total memory in use (0 MB lost due to fragmentation)
The first line can generally be ignored. Haskell programs typically allocate a roughly constant amount of memory per second of runtime, and this allocation rate is either nearly zero (for certain, unusual programs), or 0.5-2.0 gigabytes per second. This program ran for 4 seconds and allocated 3.8 gigabytes, and that's not unusual.
The bytes copied during GC and maximum residency are concerning, though. Assuming you have a program that you expect to run in constant space (i.e., there's no ever-growing data structure whose entire contents are needed), a correctly functioning Haskell program will generally not need to copy much data during garbage collection and will tend to have a maximum residency that's a small fraction of the total bytes allocated (e.g., 100 kilobytes rather than half a gigabyte), and this won't grow substantially with the number of "iterations" of whatever it is you're testing.
You can generate a quick heap profile over time without turning on formal profiling. If you compile with the GHC flag -rtsopts
, you can use:
./Toy +RTS -hT
and then display the result graphically using the hp2ps
tool:
hp2ps -c -e8in Toy.hp
evince Toy.ps &
This sort of pyramid pattern is a red flag:
Note that rapid linear increase in heap to the tune of hundreds of megabytes per second followed by a rapid linear collapse. This is the pattern you see when a huge lazy data structure is being needlessly built up before the entire computation is forced all at once. You see two pyramids here because both your second and third tests are exhibiting memory leaks.
As an aside, the x-axis is in "MUT seconds" (seconds the "mutator" is running, which excludes garbage collection), so that's why this is less than the actual 4 second runtime. That's actually another red flag. A Haskell program that's spending half its time garbage collecting probably isn't running correctly.
To get more detail on what's causing this heap pyramid, you'll need to compile with profiling enabled. Profiling may cause a program to run somewhat slower but doesn't normally change which optimizations are in place. However, the flag -fprof-auto
(and related flags) which automatically insert cost centers have the potential of causing big performance changes (by interfering with inlining, etc.). Unfortunately, the cabal --enable-profiling
flag turns on profiling (compiler flag -prof
) and the flag -fprof-auto-top
which automatically generates cost centers for top-level functions, so for your toy example, that substantially changes the behavior of your first test case (increasing the runtime from 0.4 seconds to 5 seconds, even with no +RTS
flags). That may be the problem you're seeing with profiling affecting your results. You don't need any cost centers for several additional kinds of heap profiles, so you can add the cabal flag --profiling-detail=none
to shut that off, and then your profiled program should run with timing a little slower but generally similar performance to the unprofiled version.
I don't use Cabal, but compiling with the following (which should be the equivalent of --enable-profiling --profiling-detail=none
):
ghc -O2 -rtsopts -prof Toy.hs # no -fprof-auto...
I can run your program with profiling by data type:
./Toy +RTS -hy
If I look at the heap profile graph:
this attributes most of the heap to the Int
type -- this narrows my problem down to a bunch of unevaluated lazy Int
calculations, which might point me in the right direction.
If I'm really having trouble narrowing things down and am feeling like a technical deep-dive, I can also run a heap profile by closure (flag -hd
). This tells me that the culprits are Main.sat_s7mQ
and Main.sat_s7kP
for the two pyramids respectively. This looks very mysterious, but they're the names of functions in the "STG", a low-level intermediate representation of my program generated by the compiler.
If I recompile with the same flags but add -fforce-recomp -ddump-stg -dsuppress-all
:
ghc -O2 -rtsopts -prof -fforce-recomp -ddump-stg -dsuppress-all Toy.hs
this will dump the STG that contains the definitions of these two functions. (The generated identifiers can differ with small changes to code and/or compiler flags, so it's best to recompile with the STG dumped and then re-profile that executable, to make sure the identifiers match.)
If I search the STG for the first culprit, I find the definition:
sat_s7mQ =
CCCS \u []
case ww2_s7mL of {
I# y_s7mO ->
case +# [1# y_s7mO] of sat_s7mP {
__DEFAULT -> I# [sat_s7mP];
};
};
Yes, this is all very technical, but this is STG-speak for the expression 1 + y
, which would help me zero in on the culprit.
If you don't speak STG, you can try introducing some cost centers. For example, I tried profiling only your second test case with -fprof-auto
(Cabal flag --profiling-detail=all-functions
). The profile output in Toy.prof
isn't that useful for memory leaks because it deals with total allocation instead of active (i.e., resident and not garbage collected) allocations over time, but you can create a heap profile by cost center by running:
./Toy +RTS -hc
In this case, it attributes everything to a single cost center, namely (315)countNumberCalls
. The "315" is the cost center number which you can look up in the Toy.prof
input to find the exact source code lines, if it's not clear from the name. Anyway, this at least helps narrow down the problem to countNumberCalls
.
For more complicated functions, you can sometimes narrow down the problem further by manually specifying cost centers, like so:
countNumberCalls :: (LogFunctionCalls m) => Int -> m Int
countNumberCalls 0 = return 0
countNumberCalls n = do
{-# SCC "mytell_call" #-} myTell "countNumberCalls" 1
x <- {-# SCC "recursive_call" #-} countNumberCalls $! n - 1
{-# SCC "return_statment" #-} return $ {-# SCC "one_plus_x" #-} 1 + x
This actually attributes everything to "recursive_call", so it's not that helpful.
It's not wrong, though. You actually have two memory leaks here -- the x <- countNumberCalls $! n - 1
leaks heap because x
isn't forced, and the 1 + x
leaks stack. You could enable the BangPatterns
extension and write:
!x <- countNumebrCalls $1 n - 1
and that would actually remove one of the memory leaks, speeding up the second case from 2.5 seconds to 1.0 seconds and dropping the maximum residency from 460 megs to 95 megs (and the bytes copied during GC from 1.5 Gigs to 73 kilobytes!). However, a heap profile would show linear growing stack accounting for pretty much all of that resident memory. Because stack isn't as well-tracked as heap, that would be more difficult to track down.
Some additional notes:
Even though the +RTS -h<x>
flags are primarily for heap profiling (and are discussed as "heap profiling" options in the GHC documentation), they can technically report on other uses of resident
memory besides heap including per-thread state, which includes thread state objects and stack. By default, when running a profiled binary (compiled with -prof
), the +RTS -h<x>
flags do not report on per-thread state including stack, but you can add the -xt flag to add it, as in +RTS -hc -xt
. Due to a probable unintentional oversight, on a non-profiled binary, the +RTS -hT
flag (the only -h<x>
flag available) includes stack even without the -xt
flag. Due to a compiler bug, the -hT
flag doesn't work on profiled binaries for GHC 8.6.x and earlier, but it does work on GHC 8.8.x, and for that version, +RTS -hT
includes stack on non-profiled binaries but excludes it on profiled binaries unless you also specify -xt
. That's why in the examples above, "Stack" only shows up when running a heap profile on a non-profiled binary. You can add the -xt
flag to see it for all the other heap profiles. Note that this "STACK" is actual stack use, rather than objects on the heap that are some how affiliated with the stack.
Black holes are primarily a mechanism for supporting concurrency. When a thread starts evaluating a thunk, it "blackholes" it (i.e., marks it as a black hole), so that if another thread comes along and wants to evaluate the same thunk, it waits on the evaluation instead of trying to re-evaluate it in parallel (which would duplicate the effort of the running thread). It's also used in the non-threaded runtime, partly because it can detect infinite loops (if a thread encounters its own black hole), but also for some more important reasons that I can't remember. For -hT
, -hd
, and -hy
heap profiling, heap objects that have been blackholed like this will be marked as "BLACKHOLE". The limited sampling rate in the profiles above can make it a little unclear, but what's happening in your program is that a large series of Int
thunks are being built in a chain, and when the value is finally forced, they are turned into a long chain of BLACKHOLE
s, each of which represents a computation that's been initiated and is waiting on the next computation in the chain.