Is memory barrier or atomic operation required in a busy-wait loop?

前端 未结 3 980
渐次进展
渐次进展 2021-02-12 10:52

Consider the following spin_lock() implementation, originally from this answer:

void spin_lock(volatile bool* lock)  {  
    for (;;) {
        // i         


        
3条回答
  •  温柔的废话
    2021-02-12 11:40

    From the Wikipedia page on memory barriers:

    ... Other architectures, such as the Itanium, provide separate "acquire" and "release" memory barriers which address the visibility of read-after-write operations from the point of view of a reader (sink) or writer (source) respectively.

    To me this implies that Itanium requires a suitable fence to make reads/writes visible to other processors, but this may in fact just be for purposes of ordering. The question, I think, really boils down to:

    Does there exist an architecture where a processor might never update its local cache if not instructed to do so? I don't know the answer, but if you posed the question in this form then someone else might. In such an architecture your code potentially goes into an infinite loop where the read of *lock always sees the same value.

    In terms of general C++ legality, the one atomic test and set in your example isn't enough, since it implements only a single fence which will allow you to see the initial state of the *lock when entering the while loop but not to see when it changes (which results in undefined behavior, since you are reading a variable that is changed in another thread without synchronisation) - so the answer to your question (1.1/3) is no.

    On the other hand, in practice, the answer to (1.2/2) is yes (given GCC's volatile semantics), so long as the architecture guarantees cache coherence without explicit memory fences, which is true of x86 and probably for many architectures but I can't give a definite answer on whether it is true for all architectures that GCC supports. It is however generally unwise to knowingly rely on particular behavior of code that is technically undefined behavior according to the language spec, especially if it is possible to get the same result without doing so.

    Incidentally, given that memory_order_relaxed exists, there seems little reason not to use it in this case rather than try to hand-optimise by using non-atomic reads, i.e. changing the while loop in your example to:

        while (atomic_load_explicit(lock, memory_order_relaxed)) {
            cpu_relax();
        }
    

    On x86_64 for instance the atomic load becomes a regular mov instruction and the optimised assembly output is essentially the same as it would be for your original example.

提交回复
热议问题