Poor performance / lockup with STM

前端未结

关注

 3  1166

I\'m writing a program where a large number of agents listen for events and react on them. Since Control.Concurrent.Chan.dupChan is deprecated I decided to use TCha

相关标签:

3条回答

北海茫月

2021-02-02 12:54

This is a great test case! I think you've actually created a rare instance of genuine livelock/starvation. We can test this by compiling with -eventlog and running with -vst or by compiling with -debug and running with -Ds. We see that even as the program "hangs" the runtime still is working like crazy, jumping between blocked threads.

The high-level reason is that you have one (fast) writer and many (fast) readers. The readers and writer both need to access the same tvar representing the end of the queue. Let's say that nondeterministically one thread succeeds and all others fail when this happens. Now, as we increase the number of threads in contention to 100*100, then the probability of the reader making progress rapidly goes towards zero. In the meantime, the writer in fact takes longer in its access to that tvar than do the readers, so that makes things worse for it.

In this instance, putting a tiny throttle between each invocation of go for the writer (say, threadDelay 100) is enough to fix the problem. It gives the readers enough time to all block between successive writes, and so eliminates the livelock. However, I do think that it would be an interesting problem to improve the behavior of the runtime scheduler to deal with situations like this.

0 讨论(0)
发布评论:

提交评论
- 加载中...
天命终不由人

2021-02-02 12:54

Adding to what Neil said, your code also has a space leak (noticeable with smaller n): After fixing the obvious tuple build-up issue by making tuples strict, I was left with the following profile: What's happening here, I think, is that the main thread is writing data to the shared TChan faster than the worker threads can read it (TChan, like Chan, is unbounded). So the worker threads spend most of their time reexecuting their respective STM transactions, while the main thread is busy stuffing even more data into the channel; this explains why your program hangs.

0 讨论(0)
发布评论:

提交评论
- 加载中...
慢半拍i

2021-02-02 13:10
The program is going to perform quite badly. You're spawning off 10,000 threads all of which will queue up waiting for a single TVar to be written to. So once they're all going, you may well get this happening:
1. Each of the 10,000 threads tries to read from the channel, finds it empty, and adds itself to the wait queue for the underlying TVar. So you'll have 10,000 queue-up events, and 10,000 processes in the wait queue for the TVar.
2. Something is written to the channel. This will unqueue each of the 10,000 threads and put it back on the run-queue (this may be O(N) or O(1), depending on how the RTS is written).
3. Each of the 10,000 threads must then process the item to see if it's interested in it, which most won't be.
So each item will cause processing O(10,000). If you see 100 events per second, that means that each thread requires about 1 microsecond to wake up, read a couple of TVars, write to one and queue up again. That doesn't seem so unreasonable. I don't understand why the program would grind to a complete halt, though.

In general, I would scrap this design and replace it as follows:

Have a single thread reading the event channel, which maintains a map from coordinate to interested-receiver-channel. The single thread can then pick out the receiver(s) from the map in O(log N) time (much better than O(N), and with a much smaller constant factor involved), and send the event to just the interested receiver. So you perform just one or two communications to the interested party, rather than 10,000 communications to everyone. A list-based form of the idea is written in CHP in section 5.4 of this paper: http://chplib.files.wordpress.com/2011/05/chp.pdf
0 讨论(0)
发布评论:

提交评论
- 加载中...