Haskell fast concurrent queue

后端未结

关注

 2  592

The Problem

Hello! I\'m writing a logging library and I would love to create a logger, that would run in separate thread, while all applications threads

相关标签:

2条回答

孤街浪徒

2021-01-31 10:11

If I had to guess why pipes-concurrency perform more poorly, it's because every read and write is wrapped in an STM transaction, whereas the other libraries use more efficient low-level concurrency primitives.

0 讨论(0)
发布评论:

提交评论
- 加载中...
耶瑟儿～

2021-01-31 10:35
So I can give you a little overview of some of the analysis of Chan and TQueue (which pipes-concurrency is using internally here) that motivated some design decisions that went into unagi-chan. I'm not sure if it will answer your question. I recommend forking different queues and playing with variations while benchmarking to get a real good sense of what is going on.

Chan

Chan looks like:
```
data Chan a
 = Chan (MVar (Stream a)) -- pointer to "head", where we read from
        (MVar (Stream a)) -- pointer to "tail", where values written to

type Stream a = MVar (ChItem a)
data ChItem a = ChItem a (Stream a)
```
It's a linked-list of MVars. The two MVars in the Chan type act as pointers to the current head and tail of the list, respectively. This is what a write looks like:
```
writeChan :: Chan a -> a -> IO () 
writeChan (Chan _ writeVar) val = do 
    new_hole <- newEmptyMVar   mask_ $ do
    old_hole <- takeMVar writeVar           -- [1]
    putMVar old_hole (ChItem val new_hole)  -- [2]
    putMVar writeVar new_hole               -- [3]
```
At 1 the writer takes a lock on the write end, at 2 our item a is made available to the reader, and at 3 the write end is unlocked for other writers.

This actually performs pretty well in a single-consumer/single-producer scenario (see the graph here) because reads and writes don't contend. But once you have multiple concurrent writers you can start having trouble:
- a writer that hits 1 while another writer is at 2 will block and be descheduled (the fastest I've been able to measure a context switch is ~150ns (pretty darn fast); there are probably situations where it is much slower). So when you get many writers contending you're basically making a big round-trip through the scheduler, onto a wait-queue for the MVar and then finally the write can complete.
- When a writer gets descheduled (because it timed out) while at 2, it holds onto a lock and no writes will be allowed to complete until it can be rescheduled again; this becomes more of an issue when we're over-subscribed, i.e. when our threads/core ratio is high.
Finally, using an MVar-per-item requires some overhead in terms of allocation, and more importantly when we accumulate many mutable objects we can cause a lot of GC pressure.

TQueue

TQueue is great because STM makes it super simple to reason about its correctness. It's a functional dequeue-style queue, and a write consists of simply reading the writer stack, consing our element, and writing it back:
```
data TQueue a = TQueue (TVar [a])
                       (TVar [a])

writeTQueue :: TQueue a -> a -> STM ()
writeTQueue (TQueue _ write) a = do  
  listend <- readTVar write   -- a transaction with a consistent 
  writeTVar write (a:listend) -- view of memory
```
If after a writeTQueue writes its new stack back, another interleaved write does the same, one of the writes will be retried. As more writeTQueues are interleaved the effect of contention is worsened. However performance degrades much more slowly than in Chan because there is only a single writeTVar operation that can void competing writeTQueues, and the transaction is very small (just a read and a (:)).

A read works by "dequeuing" the stack from the write side, reversing it, and storing the reversed stack in its own variable for easy "popping" (altogether this gives us amortized O(1) push and pop)
```
readTQueue :: TQueue a -> STM a
readTQueue (TQueue read write) = do
  xs <- readTVar read
  case xs of
    (x:xs') -> do writeTVar read xs'
                  return x
    [] -> do ys <- readTVar write
             case ys of
               [] -> retry
               _  -> case reverse ys of
                       [] -> error "readTQueue"
                       (z:zs) -> do writeTVar write []
                                    writeTVar read zs
                                    return z
```
Readers have a symmetrical moderate contention issue to writers. In the general case readers and writers don't contend, however when the reader stack is depleted readers are contending with other readers and writers. I suspect if you pre-loaded a TQueue with enough values and then launched 4 readers and 4 writers you might be able to induce livelock as the reverse struggled to complete before the next write. It's also interesting to note that unlike with MVar, a write to a TVar on which many readers are waiting wakes them all simultabeously (this might be more or less efficient, depending on the scenario).

I suspect you don't see much of the weaknesses of TQueue in your test; primarily you're seeing the moderate effects of write contention and the overhead of a lot of allocating and GC'ing a lot of mutable objects.

unagi-chan

unagi-chan was designed firstly to handle contention well. It's conceptually very simple, but the implementation has some complexities
```
data ChanEnd a = ChanEnd AtomicCounter (IORef (Int , Stream a))

data Stream a = Stream (Array (Cell a)) (IORef (Maybe (Stream a)))

data Cell a = Empty | Written a | Blocking (MVar a)
```
Read and write sides of the queue share the Stream on which they coordinate passing values (from writer to reader) and indications of blocking (from reader to writer), and read and write sides each have an independent atomic counter. A write works like:
1. a writer calls the atomic incrCounter on the write counter to receive its unique index on which to coordinate with its (single) reader
2. the writer finds its cell and performs a CAS of Written a
3. if successful it exits, else it sees that a reader has beat it and is blocking (or proceeding to block), so it does a (\Blocking v)-> putMVar v a) and exits.
A read works in a similar and obvious way.

The first innovation is to make the point of contention an atomic operation which doesn't degrade under contention (as a CAS/retry loop or a Chan-like lock would). Based on simple benchmarking and experimenting, the fetch-and-add primop, exposed by the atomic-primops library works best.

Then in 2 both reader and writer need perform only one compare-and-swap (the fast path for the reader is a simple non-atomic read) to finish coordination.

So to try to make unagi-chan good, we
- use fetch-and-add to handle the point of contention
- use lockfree techniques such that when we're oversubscribed a thread being descheduled at inopportune times doesn't block progress for other threads (a blocked writer may block at most the reader "assigned" to it by the counter; read caveats re. async exceptions in unagi-chan docs, and note that Chan has nicer semantics here)
- use an array to store our elements, which has better locality (but see below) lower overhead per element and puts less pressure on the GC
A final note re. using an array: concurrent writes to an array are generally a bad idea for scaling because you cause a lot of cache-coherence traffic as cachelines are invalidated back and forth across your writer threads. The general term is "false sharing". But there are also upsides cache-wise and downsides to alternative designs that I can think of that would stripe writes or something; I've been experimenting with this a bit but don't have anything conclusive at the moment.

One place where we legitimately are concerned with false sharing is in our counter, which we align and pad to 64 bytes; this did indeed show up in benchmarks, and the only downside is the increased memory usage.
0 讨论(0)
发布评论:

提交评论
- 加载中...

Haskell fast concurrent queue

Chan

TQueue

unagi-chan