The Problem
Hello! I\'m writing a logging library and I would love to create a logger, that would run in separate thread, while all applications threads
If I had to guess why pipes-concurrency
perform more poorly, it's because every read and write is wrapped in an STM
transaction, whereas the other libraries use more efficient low-level concurrency primitives.
So I can give you a little overview of some of the analysis of Chan
and TQueue
(which pipes-concurrency
is using internally here) that motivated some design decisions that went into unagi-chan
. I'm not sure if it will answer your question. I recommend forking different queues and playing with variations while benchmarking to get a real good sense of what is going on.
Chan
looks like:
data Chan a
= Chan (MVar (Stream a)) -- pointer to "head", where we read from
(MVar (Stream a)) -- pointer to "tail", where values written to
type Stream a = MVar (ChItem a)
data ChItem a = ChItem a (Stream a)
It's a linked-list of MVar
s. The two MVar
s in the Chan
type act as pointers to the current head and tail of the list, respectively. This is what a write looks like:
writeChan :: Chan a -> a -> IO ()
writeChan (Chan _ writeVar) val = do
new_hole <- newEmptyMVar mask_ $ do
old_hole <- takeMVar writeVar -- [1]
putMVar old_hole (ChItem val new_hole) -- [2]
putMVar writeVar new_hole -- [3]
At 1 the writer takes a lock on the write end, at 2 our item a
is made available to the reader, and at 3 the write end is unlocked for other writers.
This actually performs pretty well in a single-consumer/single-producer scenario (see the graph here) because reads and writes don't contend. But once you have multiple concurrent writers you can start having trouble:
a writer that hits 1 while another writer is at 2 will block and be descheduled (the fastest I've been able to measure a context switch is ~150ns (pretty darn fast); there are probably situations where it is much slower). So when you get many writers contending
you're basically making a big round-trip through the scheduler, onto a wait-queue for the MVar
and then finally the write can complete.
When a writer gets descheduled (because it timed out) while at 2, it holds onto a lock and no writes will be allowed to complete until it can be rescheduled again; this becomes more of an issue when we're over-subscribed, i.e. when our threads/core ratio is high.
Finally, using an MVar
-per-item requires some overhead in terms of allocation, and more importantly when we accumulate many mutable objects we can cause a lot of GC pressure.
TQueue
is great because STM
makes it super simple to reason about its correctness. It's a functional dequeue-style queue, and a write
consists of simply reading the writer stack, consing our element, and writing it back:
data TQueue a = TQueue (TVar [a])
(TVar [a])
writeTQueue :: TQueue a -> a -> STM ()
writeTQueue (TQueue _ write) a = do
listend <- readTVar write -- a transaction with a consistent
writeTVar write (a:listend) -- view of memory
If after a writeTQueue
writes its new stack back, another interleaved write does the same, one of the writes will be retried. As more writeTQueue
s are interleaved the effect of contention is worsened. However performance degrades much more slowly than in Chan
because there is only a single writeTVar
operation that can void competing writeTQueue
s, and the transaction is very small (just a read and a (:)
).
A read works by "dequeuing" the stack from the write side, reversing it, and storing the reversed stack in its own variable for easy "popping" (altogether this gives us amortized O(1) push and pop)
readTQueue :: TQueue a -> STM a
readTQueue (TQueue read write) = do
xs <- readTVar read
case xs of
(x:xs') -> do writeTVar read xs'
return x
[] -> do ys <- readTVar write
case ys of
[] -> retry
_ -> case reverse ys of
[] -> error "readTQueue"
(z:zs) -> do writeTVar write []
writeTVar read zs
return z
Readers have a symmetrical moderate contention issue to writers. In the general case readers and writers don't contend, however when the reader stack is depleted readers are contending with other readers and writers. I suspect if you pre-loaded a TQueue
with enough values and then launched 4 readers and 4 writers you might be able to induce livelock as the reverse struggled to complete before the next write. It's also interesting to note that unlike with MVar
, a write to a TVar
on which many readers are waiting wakes them all simultabeously (this might be more or less efficient, depending on the scenario).
I suspect you don't see much of the weaknesses of TQueue
in your test; primarily you're seeing the moderate effects of write contention and the overhead of a lot of allocating and GC'ing a lot of mutable objects.
unagi-chan
was designed firstly to handle contention well. It's conceptually very simple, but the implementation has some complexities
data ChanEnd a = ChanEnd AtomicCounter (IORef (Int , Stream a))
data Stream a = Stream (Array (Cell a)) (IORef (Maybe (Stream a)))
data Cell a = Empty | Written a | Blocking (MVar a)
Read and write sides of the queue share the Stream
on which they coordinate passing values (from writer to reader) and indications of blocking (from reader to writer), and read and write sides each have an independent atomic counter. A write works like:
a writer calls the atomic incrCounter
on the write counter to receive its unique index on which to coordinate with its (single) reader
the writer finds its cell and performs a CAS of Written a
if successful it exits, else it sees that a reader has beat it and is blocking (or proceeding to block), so it does a (\Blocking v)-> putMVar v a)
and exits.
A read works in a similar and obvious way.
The first innovation is to make the point of contention an atomic operation which doesn't degrade under contention (as a CAS/retry loop or a Chan-like lock would). Based on simple benchmarking and experimenting, the fetch-and-add primop, exposed by the atomic-primops library works best.
Then in 2 both reader and writer need perform only one compare-and-swap (the fast path for the reader is a simple non-atomic read) to finish coordination.
So to try to make unagi-chan
good, we
use fetch-and-add to handle the point of contention
use lockfree techniques such that when we're oversubscribed a thread being descheduled at inopportune times doesn't block progress for other threads (a blocked writer may block at most the reader "assigned" to it by the counter; read caveats re. async exceptions in unagi-chan
docs, and note that Chan
has nicer semantics here)
use an array to store our elements, which has better locality (but see below) lower overhead per element and puts less pressure on the GC
A final note re. using an array: concurrent writes to an array are generally a bad idea for scaling because you cause a lot of cache-coherence traffic as cachelines are invalidated back and forth across your writer threads. The general term is "false sharing". But there are also upsides cache-wise and downsides to alternative designs that I can think of that would stripe writes or something; I've been experimenting with this a bit but don't have anything conclusive at the moment.
One place where we legitimately are concerned with false sharing is in our counter, which we align and pad to 64 bytes; this did indeed show up in benchmarks, and the only downside is the increased memory usage.