Difference btw atomic exchange (without return val) and store? It's about Peterson algorithm with atomic lib

问题

std::atomic<int> flag0(0),flag1(0),turn(0);

void lock(unsigned index)
{
    if (0 == index)
    {
        flag0.store(1, std::memory_order_relaxed);
        turn.exchange(1, std::memory_order_acq_rel);
        //turn.store(1)

        while (flag1.load(std::memory_order_acquire)
            && 1 == turn.load(std::memory_order_relaxed))
            std::this_thread::yield();
    }
    else
    {
        flag1.store(1, std::memory_order_relaxed);
        turn.exchange(0, std::memory_order_acq_rel);
        //turn.store(0)

        while (flag0.load(std::memory_order_acquire)
            && 0 == turn.load(std::memory_order_relaxed))
            std::this_thread::yield();
    }
}

void unlock(unsigned index)
{
    if (0 == index)
    {
        flag0.store(0, std::memory_order_release);
    }
    else
    {
        flag1.store(0, std::memory_order_release);
    }
}

turn.exchange(0) without left (using like void return function) works similarly to 'turn.store(0)'.

Is there any reason for using 'exchange' method?

In this algorithm, this code doesn't need to save previous value.

回答1:

The main difference is than on x86 exchange translates to a lock xchg instruction which is sequentially consistent, even tough you specified it as std::memory_order_acq_rel! If you were to use a store with std::memory_order_release, the internal store buffer would spoil your mutual exclusion guarantee (i.e., your lock would be broken)!. However, if you use a store with std::memory_order_seq_cst, many compiler will simply translate it to lock xchg as well, so you end up with the same machine code.

That said, you should NOT rely on the fact that exchange is implicitly sequentially consistent. Instead you should specify the C++ memory orders as required, to ensure your code behaves correctly with respect to the C++ standard.

UPDATE
There exist various definitions of sequential consistency that try to explain the same idea it in different terms. Leslie Lamport described it as follows:

... the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program."

The C++ standard provides the following definition:

There shall be a single total order S on all memory_order_seq_cst operations, consistent with the "happens before" order and modification orders for all affected locations, such that each memory_order_seq_cst operation B that loads a value from an atomic object M observes one of the following values:

(3.1) the result of the last modification A of M that precedes B in S, if it exists, or

(3.2) if A exists, the result of some modification of M that is not memory_order_seq_cst and that does not happen before A, or

(3.3) if A does not exist, the result of some modification of M that is not memory_order_seq_cst.

Essentially what this means is that if the exchange and the load operations are both sequentially consistent, then they are strictly ordered in the total order S - so either the exchange is ordered before the load or vice versa. If the exchange is ordered before the load, then the load is guaranteed to see the value stored by the exchange (or some later value if such exists). If you have a store that is not sequentially consistent, you do not have such a guarantee, i.e., in this case it could happen that both threads succeed in acquiring the lock, simple because they did not "see" the value stored by the other thread.

The x86 memory model is very strong, and every lock prefixed instruction is sequentially consistent. That's why in many cases you don't even notice that your code does not enforce the necessary happens before relations if you are running on a x86 CPU. But things wouldn't run as smoothly if you were to run it on ARM or Power.

回答2:

[Replacing a previous answer.]

The original question was "Is there any reason for using 'exchange' method?". And the short answer is: no, there is no good reason for it, in fact turn.store(1) is more correct.

But even with turn.store(1) I think what you have is almost entirely not valid C++.

So here is a longer answer...

In Search of a Correct Implementation of Peterson's Algorithm

Peterson's Algorithm will work if all loads/stores of flag0, flag1 and turn are memory_order_seq_cst, thus:

std::atomic<int> flag0(0),flag1(0),turn(0);

void lock(unsigned index)
{
    if (0 == index)
    {
        flag0.store(1, std::memory_order_seq_cst) ;
        turn.store(1, std::memory_order_seq_cst) ;
        while (flag1.load(std::memory_order_seq_cst)
                          && (1 == turn.load(std::memory_order_seq_cst)))
            std::this_thread::yield();
    }
    else
    {
        flag1.store(1, std::memory_order_seq_cst) ;
        turn.store(0, std::memory_order_seq_cst) ;
        while (flag0.load(std::memory_order_seq_cst)
                          && (0 == turn.load(std::memory_order_seq_cst)))
            std::this_thread::yield();
    }
}

void unlock(unsigned index)
{
    if (0 == index)
      flag0.store(0, std::memory_order_seq_cst) ;
    else
      flag1.store(0, std::memory_order_seq_cst) ;
}

[Of course, std::memory_order_seq_cst is the default -- but it does no harm to be explicit... leaving aside the clutter.]

Peterson's Algorithm works provided that from the perspective of thread 1:

flag0 = true happens-before turn = 1 (in lock()) in thread 0
thread 1 reads the most recent value of turn written by it or thread 0
turn = 1 happens-before flag0 = false (in unlock()) in thread 0
flag0 = false (in unlock()) happens-before flag0 = true (in lock()) in thread 0

And vice-versa for thread 0. In short, (i) all the stores must inter-thread-happen-before each other, and (ii) the loads must read the most recent values written to shared memory.

These conditions are met if all these operations are _seq_cst.

Of course, _seq_cst is (generally) expensive. So the question is, can any of these operations be weakened ?

The Standard (as I understand it):

gets very twitchy about mixing _seq_cst operations on a variable with any other memory-order operations on that variable -- as in "do that and you are on your own, sunshine".

So, if one of the operations on any of flag0, flag1 or turn is _seq_cst, then all of the operations on that variable should be _seq_cst -- to stay within the Standard.
says that all _seq_cst atomic operations on all variables appear to all threads to happen in the same order -- hence their use above.

BUT says nothing about not-_seq_cst atomic operations (far less not-atomic ones) appearing in any particular order wrt to the _seq_cst operations.

So, if turn, say, is loaded/stored _seq_cst but flag0 and flag1 are not, then the Standard does not specify the relative order of the stores of turn and flag0 as seen by thread 1, or of turn and flag1 as seen by thread 0.

[If I have not properly understood the Standard, someone please correct me !]

As far as I can tell, this means that all the operations on turn, flag0 and flag1 are required, by the Standard, to be _seq_cst...

...unless we use _seq_cst fence instead.

A job for memory_order_seq_cst fence ?

Suppose we recast to use fences, thus:

void lock(unsigned index)
{
    if (0 == index)
    {
        flag0.store(1, std::memory_order_relaxed) ;
        std::atomic_thread_fence(std::memory_order_release) ;  // <<<<<<<<<<<< (A)
        turn.store(1, std::memory_order_relaxed) ;
        std::atomic_thread_fence(std::memory_order_seq_cst) ;  // <<<<<<<<<<<< (B)
        while (flag1.load(std::memory_order_relaxed)
                          && (1 == turn.load(std::memory_order_relaxed)))
            std::this_thread::yield() ;
    }
    else
    {
        flag1.store(1, std::memory_order_relaxed) ;
        std::atomic_thread_fence(std::memory_order_release) ;  // <<<<<<<<<<<< (A)
        turn.store(0, std::memory_order_relaxed) ;
        std::atomic_thread_fence(std::memory_order_seq_cst) ;  // <<<<<<<<<<<< (B)
        while (flag0.load(std::memory_order_relaxed)
                          && (0 == turn.load(std::memory_order_relaxed)))
            std::this_thread::yield() ;
    }
}

void unlock(unsigned index)
{
    if (0 == index)
      flag0.store(0, std::memory_order_relaxed) ; ;
    else
      flag1.store(0, std::memory_order_relaxed) ; ;
}

The _release fence (A) after the store of flagX means that it will be visible to the other thread before the store of turn. The _seq_cst fence (B) after the store of turn means (i) that it become visible to the other thread after flagX is set true and before flagX is set false, and (ii) that any load of turn which follows the fence in either thread will see the latest store of turn -- _seq_cst-wise.

The store of flagX in unlock() will happen-before the next store of flagX in lock() -- every atomic object has its own modification-order.

So, I believe this works, per the Standard, with the minimum of memory-order magic.

Is the _release fence (A) really required ? I believe the answer to that is yes -- that fence is required to ensure the inter-thread-happens-before ordering of the stores of flagX and turn.

Could the _seq_cst fence (B) also be _release ? I believe the answer to that is no -- that fence is required to ensure that the stores and loads of turn in both threads agree on the order in which turn is written (in shared-memory).

Notes on x86/x86_64

For the x86/x86_64, for BYTE, WORD, DWORD and QWORD atomics:

_release and _relaxed stores are the same, and compile to simple writes.
_acquire, _consume and _relaxed loads are the same, and compile to simple reads.
except for _seq_cst all fences are the same, and compile to nothing at all.
_seq_cst fences compile to MFENCE.
all exchanges, including compare-exchanges, are _seq_cst, and compile to an instruction with a LOCK prefix (or an instruction with an implied LOCK prefix).
for _seq_cst loads/stores, by convention: loads compile to simple reads and stores compile to either MOV+MFENCE or (LOCK) XCHG -- for more on the convention, see below.

...provided the value is correctly aligned, or since P6 (!) does not cross a cache-line-boundary. [Note that I use read/write to refer to the instructions which implement the load/store operations.]

So, the lock() with fences, for thread 0, will compile to (roughly):

      MOV  [flag0], $1        -- flag0.store(1, std::memory_order_relaxed)
      MOV  [turn],  $1        -- turn.store(1, std::memory_order_relaxed)
      MFENCE                  -- std::atomic_thread_fence(std::memory_order_seq_cst)
      JMP  check
    wait:                     -- while
      CALL ....               -- std::this_thread::yield()
    check:
      MOV  eax, [flag0]       -- (flag1.load(std::memory_order_relaxed)       
      TEST eax, eax
      JZ   gotit              -- have lock if !flag1
      MOV  eax, [turn]        -- (1 == turn.load(std::memory_order_relaxed)))
      CMP  eax, $1
      JZ   wait               -- must wait if turn == 1
    gotit:

where all the memory operations are simple read/write, and there is the one MFENCE. MFENCE is not cheap, but is the minimum overhead required to make this thing work.

From my understanding of the x86/x86_64, I can say that the above will work.

Returning to the Original Question

The original code is not valid C++ and the result of compiling it is uncertain.

However, when compiled for x86/x86_64, it (in all probability) will in fact work. The reasons for that are interesting.

For those of a nervous disposition, let me be crystal clear: in what follows when I say that 'X' "works" I mean that when compiled for x86/x86_64, using the current common mechanisms to implement atomic operations on the x86/x86_64, the code generated will give the intended result. This does not make 'X' correct C++ and certainly does not mean that it will give the intended result on other machines.

So the original code may be expected to compile to one of:

    # MOV+MFENCE version      |   # (LOCK) XCHG version
      MOV  [flag0], $1        |     MOV  [flag0], $1
      MOV  [turn],  $1        |     MOV  eax,  $1  
      MFENCE                  |     XCHG [turn], eax   # LOCK is implicit
      .... as above           |     .... ditto

and both versions work.

In the original code the turn.exchange(1, std::memory_order_acq_rel) will compile to the (LOCK) XCHG version -- making it, in fact, _seq_cst (because all exchanges on the x86/x86_64 are _seq_cst).

NB: in general, turn.exchange(1, std::memory_order_acq_rel) is not equivalent to turn.store(1) -- you need turn.exchange(1, std::memory_order_seq_cst) for that. It's just that on the x86/x86_64 they compile to the same thing.

For turn.store(1) the compiler may choose either the MFENCE or the (LOCK) XCHG version -- they are functionally equivalent.

Now, what is required here is a store. It's possible that the compiler will prefer the (LOCK) XCHG version for that (though I doubt it). But I see no point in second guessing the compiler and forcing it to use (LOCK) XCHG. [It's possible that the compiler might spot that the return value of the turn.exchange() is being ignored, and therefore use the MFENCE... but there's still no reason to second guess the compiler.]

The original question was "Is there any reason for using 'exchange' method?". And the answer to that, finally (!), is no -- for the reasons given.

More on x86/x86_64 and Load/Store _seq_cst Convention(s)

On x86/x86_64, to store and load some variable _seq_cst requires either:

an MFENCE (somewhere) between the write and the read of the variable.

Conventionally, the MFENCE is treated as part of _seq_cst stores (write+mfence), so that a load _seq_cst maps to a simple read.

Alternatively, the MFENCE could be treated as part of the load (mfence+read), but on the basis that loads tend to outnumber stores, the (significant) overhead is assigned to the stores.

or:

a LOCK XCHG for the write or a LOCK XADD $0 for the read of the variable.

Conventionally, a LOCK XCHG is used for the write (xchg-write), so that, again, a load _seq_cst maps to a simple read.

Alternatively, a LOCK XADD $0 could be used for the load (xadd-read), so that a store would map to a simple write. But for the same reason as above, this is not done.

If there were no such convention, both _seq_cst load and store operations would have to carry the MFENCE or XCHG/XADD overhead. This would have the advantage that a _seq_cst load after a not-_seq_cst store would work -- but at significant cost. The Standard does not require such "mixed" memory-order combinations to work, so this extra cost can be avoided. [The limitation in the Standard is not arbitrary !]

For the avoidance of doubt: it is essential that the same convention is followed throughout -- across an application, the libraries it uses and the kernel. The write-mfence/xchg-read convention has the edge over the mfence+read/xadd-read convention, and is definitely better than no convention at all. So, the write-mfence/xchg-read convention is the de facto standard.

[For a summary of the mapping of the simple atomic operations to instructions, for all memory-orders, for a number of common processors see https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html. For x86/x86_64 nearly every load/store maps to a simple read/write, and all exchanges and cmp-exchanges map to a LOCKed instruction (so are all _seq_cst). This is not true of ARM, POWERPC and others, so the correct choice of memory-order is essential.]

来源：https://stackoverflow.com/questions/61399770/difference-btw-atomic-exchange-without-return-val-and-store-its-about-peters

标签

c++

atomic

atomicity