C++ Low-Latency Threaded Asynchronous Buffered Stream (intended for logging) – Boost

泄露秘密 提交于 2019-12-04 07:27:13

General solution.

I think you must look at the Naggle algorithm. For one producer and one consumer this would look like this:

  • At the beginning buffer is empty, worker thread is idle and waiting for the events.
  • Producer writes data to the buffer and notifies worker thread.
  • Worker thread woke up and start the write operation.
  • Producer tries to write another message, but buffer is used by worker, so producer allocates another buffer and writes message to it.
  • Producer tries to write another message, I/O still in progress so producer writes message to previously allocated buffer.
  • Worker thread done writing buffer to file and sees that there is another buffer with data so it grabs it and starts to write.
  • The very first buffer is used by producer to write all consecutive messages, until second write operation in progress.

This schema will help achieve low latency requirement, single message will be written to disc instantaneously, but large amount of events will be written by large batches for greather throughput.

If your log messages have levels - you can improve this schema a little bit. All error messages have high priority(level) and must be saved on disc immediately (because they are rare but very valuable) but debug and trace messages have low priority and can be buffered to save bandwidth (because they are very frequent but not as valuable as error and info messages). So when you write error message, you must wait until worker thread is done writing your message (and all messages that are in the same buffer) and then continue, but debug and trace messages can be just written to buffer.

Threading.

Spawning worker thread for each application thread is to costly. You must use single writer thread for each log file. Write buffers must be shared between threads. Each buffer must have two pointers - commit_pointer and prepare_pointer. All buffer space between beginning of the buffer and commit_pointer are available for worker thread. Buffer space between commit_pointer and prepare_pointer are currently updated by application threads. Invariant: commit_pointer <= prepare_pointer.

Write operations can be performed in two steps.

  1. Prepare write. This operation reserves space in a buffer.
    • Producer calculates len(message) and atomically updates prepare_pointer;
    • Old prepare_pointer value and len is saved by consumer;
  2. Commit write.
    • Producer writes message at the beginning of the reserved buffer space (old prepare_pointer value).
    • Producer busy-waits until commit_pointer is equal to old prepare_pointer value that its save in local variable.
    • Producer commit write operation by doing commit_pointer = commit_pointer + len atomically.

To prevent false sharing, len(message) can be rounded to cache line size and all extra space can be filled with spaces.

// pseudocode
void write(const char* message) {
    int len = strlen(message);  // TODO: round to cache line size
    const char* old_prepare_ptr;
    // Prepare step
    while(1) 
    {
        old_prepare_ptr = prepare_ptr;
        if (
            CAS(&prepare_ptr, 
                 old_prepare_ptr, 
                 prepare_ptr + len) == old_prepare_ptr
            )
            break;
        // retry if another thread perform prepare op.
    }
    // Write message
    memcpy((void*)old_prepare_ptr, (void*)message, len);
    // Commit step
    while(1)
    {
        const char* old_commit_ptr = commit_ptr;
        if (
             CAS(&commit_ptr, 
                  old_commit_ptr, 
                  old_commit_ptr + len) == old_commit_ptr
            )
            break;
        // retry if another thread commits
    }
    notify_worker_thread();
}
Evgeny Panasyuk

concurrent_queue<T, Size>

The question that I have is how to make the worker thread go to work as soon as there is work to do and sleep when there is no work.

There is boost::lockfree::spsc_queue - wait-free single-producer single-consumer queue. It can be configured to have compile-time capacity (the size of the internal ringbuffer).

From what I understand, you want something similar to following configuration:

template<typename T, size_t N>
class concurrent_queue
{
    // T can be wrapped into struct with padding in order to avoid false sharing
    mutable boost::lockfree::spsc_queue<T, boost::lockfree::capacity<N>> q;
    mutable mutex m;
    mutable condition_variable c;

    void wait() const
    {
        unique_lock<mutex> u(m);
        c.wait_for(u, chrono::microseconds(1)); // Or whatever period you need.
        // Timeout is required, because modification happens not under mutex
        //     and notification can be lost.
        // Another option is just to use sleep/yield, without notifications.
    }
    void notify() const
    {
        c.notify_one();
    }
public:
    void push(const T &t)
    {
        while(!q.push(t))
            wait();
        notify();
    }
    void pop(T &result)
    {
        while(!q.pop(result))
            wait();
        notify();
    }
};

When there are elements in queue - pop does not block. And when there is enough space in internal buffer - push does not block.


concurrent<T>

I want to reduce both formatting and write times as much as possible so I plan to reduce both.

Check out Herb Sutter talk at C++ and Beyond 2012: C++ Concurrency. At page 14 he shows example of concurrent<T>. Basically it is wrapper around object of type T which starts separate thread for performing all operations on that object. Usage is:

concurrent<ostream*> x(&cout); // starts thread internally
// ...
// x acts as function object.
// It's function call operator accepts action
//   which is performed on wrapped object in separate thread.
int i = 42;
x([i](ostream *out){ *out << "i=" << i; }); // passing lambda as action

You can use similar pattern in order to offload all formatting work to consumer thread.


Small Object Optimization

Otherwise, new buffers are allocated and I want to avoid memory allocation after the buffer stream is constructed.

Above concurrent_queue<T, Size> example uses fixed-size buffer which is fully contained within queue, and does not imply additional allocations.

However, Herb's concurrent<T> example uses std::function to pass action into worker thread. That may incur costly allocation.

std::function implementations may use Small Object Optimization (and most implementations do) - small function objects are in-place copy-constructed in internal buffer, but there is no guarantee, and for function objects bigger than threshold - heap allocation would happen.

There are several options to avoid this allocation:

  1. Implement std::function analog with internal buffer large enough to hold target function objects (for example, you can try to modify boost::function or this version).

  2. Use your own function object which would represent all type of log messages. Basically it would contain just values required to format message. As potentially there are different types of messages, consider to use boost::variant (which is literary union coupled with type tag) to represent them.

Putting it all together, here is proof-of-concept (using second option):

LIVE DEMO

#include <boost/lockfree/spsc_queue.hpp>
#include <boost/optional.hpp>
#include <boost/variant.hpp>

#include <condition_variable>
#include <iostream>
#include <cstddef>
#include <thread>
#include <chrono>
#include <mutex>

using namespace std;

/*********************************************/
template<typename T, size_t N>
class concurrent_queue
{
    mutable boost::lockfree::spsc_queue<T, boost::lockfree::capacity<N>> q;
    mutable mutex m;
    mutable condition_variable c;

    void wait() const
    {
        unique_lock<mutex> u(m);
        c.wait_for(u, chrono::microseconds(1));
    }
    void notify() const
    {
        c.notify_one();
    }
public:
    void push(const T &t)
    {
        while(!q.push(t))
            wait();
        notify();
    }
    void pop(T &result)
    {
        while(!q.pop(result))
            wait();
        notify();
    }
};

/*********************************************/
template<typename T, typename F>
class concurrent
{
    typedef boost::optional<F> Job;

    mutable concurrent_queue<Job, 16> q; // use custom size
    mutable T x;
    thread worker;

public:
    concurrent(T x)
        : x{x}, worker{[this]
        {
            Job j;
            while(true)
            {
                q.pop(j);
                if(!j) break;
                (*j)(this->x); // you may need to handle exceptions in some way
            }
        }}
    {}
    void operator()(const F &f)
    {
        q.push(Job{f});
    }
    ~concurrent()
    {
        q.push(Job{});
        worker.join();
    }
};

/*********************************************/
struct LogEntry
{
    struct Formatter
    {
        typedef void result_type;
        ostream *out;

        void operator()(double x) const
        {
            *out << "floating point: " << x << endl;
        }
        void operator()(int x) const
        {
            *out << "integer: " << x << endl;
        }
    };
    boost::variant<int, double> data;

    void operator()(ostream *out)
    {
        boost::apply_visitor(Formatter{out}, data);
    }
};

/*********************************************/
int main()
{
    concurrent<ostream*, LogEntry> log{&cout};

    for(int i=0; i!=1024; ++i)
    {
        log({i});
        log({i/10.});
    }
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!