Sometime ago, I've found a nice solution to this problem. I believe that it the smallest found so far.
The repository has a example of how use it to create N threads (readers and writers) and make then share a single seat.
I made some benchmarks, on the test example and got the following results (in million ops/sec) :
By buffer size
By number of threads
Notice how the number of threads do not change the throughput.
I think this is the ultimate solution to this problem. It works and is incredible fast and simple. Even with hundreds of threads and a queue of a single position. It can be used as a pipeline beween threads, allocating space inside the queue.
The repository has some early versions written in C# and pascal. Im working to make something more complete polished to show its real powers.
I hope some of you can validate the work or help with some ideas. Or at least, can you break it?