I\'m in the process of designing a system which connects to one or more stream of data feeds and do some analysis on the data than trigger events based on the result. In a t
I've made a particular study of lock-free data structures in the last couple of years. I've read most of the papers in the field (there's only about fourty or so - although only about ten or fifteen are any real use :-)
AFAIK, a lock-free circular buffer has not been invented. The problem will be dealing with the complex condition where a reader overtakes a writer or vis-versa.
If you have not spent at least six months studying lock-free data structures, do not attempt to write one yourself. You will get it wrong and it may not be obvious to you that errors exist, until your code fails, after deployment, on new platforms.
I believe however there is a solution to your requirement.
You should pair a lock-free queue with a lock-free free-list.
The free-list will give you pre-allocation and so obviate the (fiscally expensive) requirement for a lock-free allocator; when the free-list is empty, you replicate the behaviour of a circular buffer by instantly dequeuing an element from the queue and using that instead.
(Of course, in a lock-based circular buffer, once the lock is obtained, obtaining an element is very quick - basically just a pointer dereference - but you won't get that in any lock-free algorithm; they often have to go well out of their way to do things; the overhead of failing a free-list pop followed by a dequeue is on a par with the amount of work any lock-free algorithm will need to be doing).
Michael and Scott developed a really good lock-free queue back in 1996. A link below will give you enough details to track down the PDF of their paper; Michael and Scott, FIFO
A lock-free free-list is the simplest lock-free algorithm and in fact I don't think I've seen an actual paper for it.
If you take as a prerequisite that the buffer will never become full, consider using this lock-free algorithm:
capacity must be a power of 2
buffer = new T[capacity] ~ on different cache line
mask = capacity - 1
write_index ~ on different cache line
read_index ~ on different cache line
enqueue:
write_i = write_index.fetch_add(1) & mask
buffer[write_i] = element ~ release store
dequeue:
read_i = read_index.fetch_add(1) & mask
element
while ((element = buffer[read_i] ~ acquire load) == NULL) {
spin loop
}
buffer[read_i] = NULL ~ relaxed store
return element
The implementation in the boost library is worth considering. It's easy to use and fairly high performance. I wrote a test & ran it on a quad core i7 laptop (8 threads) and get ~4M enqueue/dequeue operations a second. Another implementation not mentioned so far is the MPMC queue at http://moodycamel.com/blog/2014/detailed-design-of-a-lock-free-queue. I have done some simple testing with this implementation on the same laptop with 32 producers and 32 consumers. It is, as advertised, faster that the boost lockless queue.
As most of the other answers state lockless programming is hard. Most implementations will have hard to detect corner cases that take a lot of testing & debugging to fix. These are typically fixed with careful placement of memory barriers in the code. You will also find proofs of correctness published in many of the academic articles. I prefer testing these implementations with a brute force tool. Any lockless algorithm you plan on using in production should be checked for correctness using a tool like http://research.microsoft.com/en-us/um/people/lamport/tla/tla.html.
There are situations that you don't need locking to prevent race condition, especially when you have only one producer and consumer.
Consider this paragraph from LDD3:
When carefully implemented, a circular buffer requires no locking in the absence of multiple producers or consumers. The producer is the only thread that is allowed to modify the write index and the array location it points to. As long as the writer stores a new value into the buffer before updating the write index, the reader will always see a consistent view. The reader, in turn, is the only thread that can access the read index and the value it points to. With a bit of care to ensure that the two pointers do not overrun each other, the producer and the consumer can access the buffer concurrently with no race conditions.
One useful technique to reduce contention is to hash the items into multiple queues and have each consumer dedicated to a "topic".
For most-recent number of items your consumers are interested in - you don't want to lock the whole queue and iterate over it to find an item to override - just publish items in N-tuples, i.e. all N recent items. Bonus points for implementation where producer would block on the full queue (when consumers can't keep up) with a timeout, updating its local tuple cache - that way you don't put back-pressure on the data source.
There is a pretty good series of articles about this on DDJ. As a sign of how difficult this stuff can be, it's a correction on an earlier article that got it wrong. Make sure you understand the mistakes before you roll your own )-;