Circular lock-free buffer

后端 未结 18 674
悲哀的现实
悲哀的现实 2020-11-29 15:00

I\'m in the process of designing a system which connects to one or more stream of data feeds and do some analysis on the data than trigger events based on the result. In a t

相关标签:
18条回答
  • 2020-11-29 15:50

    Sutter's queue is sub-optimal and he knows it. The Art of Multicore programming is a great reference but don't trust the Java guys on memory models, period. Ross's links will get you no definite answer because they had their libraries in such problems and so on.

    Doing lock-free programming is asking for trouble, unless you want to spend a lot of time on something that you are clearly over-engineering before solving the problem (judging by the description of it, it is a common madness of 'looking for perfection' in cache coherency). It takes years and leads to not solving the problems first and optimising later, a common disease.

    0 讨论(0)
  • 2020-11-29 15:52

    The term of art for what you want is a lock-free queue. There's an excellent set of notes with links to code and papers by Ross Bencina. The guy whose work I trust the most is Maurice Herlihy (for Americans, he pronounces his first name like "Morris").

    0 讨论(0)
  • 2020-11-29 15:53

    Here is how I would do it:

    • map the queue into an array
    • keep state with a next read and next next write indexes
    • keep an empty full bit vector around

    Insertion consists of using a CAS with an increment and roll over on the next write. Once you have a a slot, add your value and then set the empty/full bit that matches it.

    Removals require a check of the bit before to test on underflows but other than that, are the same as for the write but using read index and clearing the empty/full bit.

    Be warned,

    1. I'm no expert in these things
    2. atomic ASM ops seem to be very slow when I've used them so if you end up with more than a few of them, you might be faster to use locks embedded inside the insert/remove functions. The theory is that a single atomic op to grab the lock followed by (very) few non atomic ASM ops might be faster than the same thing done by several atomic ops. But to make this work would require manual or automatic inlineing so it's all one short block of ASM.
    0 讨论(0)
  • 2020-11-29 15:55

    This is an old thread, but since it hasn't been mentioned, yet - there is a lock-free, circular, 1 producer -> 1 consumer, FIFO available in the JUCE C++ framework.

    https://www.juce.com/doc/classAbstractFifo#details

    0 讨论(0)
  • 2020-11-29 15:58

    I am not expert of hardware memory models and lock free data structures and I tend to avoid using those in my projects and I go with traditional locked data structures.

    However I`ve recently noticed that video : Lockless SPSC queue based on ring buffer

    This is based on an open source high performance Java library called LMAX distruptor used by a trading system : LMAX Distruptor

    Based on the presentation above, you make head and tail pointers atomic and atomically check for the condition where head catches tail from behind or vice versa.

    Below you can see a very basic C++11 implementation for it :

    // USING SEQUENTIAL MEMORY
    #include<thread>
    #include<atomic>
    #include <cinttypes>
    using namespace std;
    
    #define RING_BUFFER_SIZE 1024  // power of 2 for efficient %
    class lockless_ring_buffer_spsc
    {
        public :
    
            lockless_ring_buffer_spsc()
            {
                write.store(0);
                read.store(0);
            }
    
            bool try_push(int64_t val)
            {
                const auto current_tail = write.load();
                const auto next_tail = increment(current_tail);
                if (next_tail != read.load())
                {
                    buffer[current_tail] = val;
                    write.store(next_tail);
                    return true;
                }
    
                return false;  
            }
    
            void push(int64_t val)
            {
                while( ! try_push(val) );
                // TODO: exponential backoff / sleep
            }
    
            bool try_pop(int64_t* pval)
            {
                auto currentHead = read.load();
    
                if (currentHead == write.load())
                {
                    return false;
                }
    
                *pval = buffer[currentHead];
                read.store(increment(currentHead));
    
                return true;
            }
    
            int64_t pop()
            {
                int64_t ret;
                while( ! try_pop(&ret) );
                // TODO: exponential backoff / sleep
                return ret;
            }
    
        private :
            std::atomic<int64_t> write;
            std::atomic<int64_t> read;
            static const int64_t size = RING_BUFFER_SIZE;
            int64_t buffer[RING_BUFFER_SIZE];
    
            int64_t increment(int n)
            {
                return (n + 1) % size;
            }
    };
    
    int main (int argc, char** argv)
    {
        lockless_ring_buffer_spsc queue;
    
        std::thread write_thread( [&] () {
                 for(int i = 0; i<1000000; i++)
                 {
                        queue.push(i);
                 }
             }  // End of lambda expression
                                                    );
        std::thread read_thread( [&] () {
                 for(int i = 0; i<1000000; i++)
                 {
                        queue.pop();
                 }
             }  // End of lambda expression
                                                    );
        write_thread.join();
        read_thread.join();
    
         return 0;
    }
    
    0 讨论(0)
  • 2020-11-29 15:58

    Sometime ago, I've found a nice solution to this problem. I believe that it the smallest found so far.

    The repository has a example of how use it to create N threads (readers and writers) and make then share a single seat.

    I made some benchmarks, on the test example and got the following results (in million ops/sec) :

    By buffer size

    By number of threads

    Notice how the number of threads do not change the throughput.

    I think this is the ultimate solution to this problem. It works and is incredible fast and simple. Even with hundreds of threads and a queue of a single position. It can be used as a pipeline beween threads, allocating space inside the queue.

    The repository has some early versions written in C# and pascal. Im working to make something more complete polished to show its real powers.

    I hope some of you can validate the work or help with some ideas. Or at least, can you break it?

    0 讨论(0)
提交回复
热议问题