Memory barrier vs Interlocked impact on memory caches coherency timing

Simplified question:

Is there a difference in timing of memory caches coherency (or "flushing") caused by Interlocked operations compared to Memory barriers? Let's consider in C# - any Interlocked operations vs Thread.MemoryBarrier(). I believe there is a difference.

Background:

I read quite few information about memory barriers - all the impact on prevention of specific types of memory interaction instructions reordering, but I couldn't find consistent info on whether they should cause immediate flushing of read/write queues.

I actually found few sources mentioning that there is NO guarantee on immediacy of the operation (only the prevention of specific reordering is guaranteed). E.g.

Wikipedia: "However, to be clear, it does not mean any operations WILL have completed by the time the barrier completes; only the ORDERING of the completion of operations (when they do complete) is guaranteed"

Freebsd.org (barriers are HW specific, so I guess a specific OS doesn't matter): "memory barriers simply determine relative order of memory operations; they do not make any guarantee about timing of memory operations"

On the other hand Interlocked operations - ~~from their definition - causes immediate flushing of all memory buffers to guarantee the most recent value of variable was updated~~ causes memory subsystem to lock the entire cache line with the value, to prevent access (including reads) from any other CPU/core, until the operation is done.

Am I correct or am I mistaken?

Disclaimer:

This is an evolution of my original question here Variable freshness guarantee in .NET (volatile vs. volatile read)

EDIT1: Fixed my statement about Interlocked operations - inline the text.

EDIT2: Completely remove demonstration code + it's discussion (as some complained about too much information)

To understand C# interlocked operations, you need to understand Win32 interlocked operations.

The "pure" interlocked operations themselves only affect the freshness of the data directly referenced by the operation.

But in Win32, interlocked operations used to imply full memory barrier. I believe this is mostly to avoid breaking old programs on newer hardware. So InterlockedAdd does two things: interlocked add (very cheap, does not affect caches) and full memory barrier (rather heavy op).

Later, Microsoft realized this is expensive, and added versions of each operation that does no or partial memory barrier.

So there are now (in Win32 world) four versions of almost everything: e.g. InterlockedAdd (full fence), InterlockedAddAcquire (read fence), InterlockedAddRelease (write fence), pure InterlockedAddNoFence (no fence).

In C# world, there is only one version, and it matches the "classic" InterlockedAdd - that also does the full memory fence.

Short answer: CAS (Interlocked) operations have been (and most likely will) be the quickest caches flusher.

Background: - CAS operations are supported in HW by single uninteruptable instruction. Compared to thread calling memory barrier which can be swapped right after placing the barrier but just before performing any reads/writes (so consistency guaranteed for the barrier is still met). - CAS operations are foundations for majority (if not all) high level synchronization construct (mutexes, sempahores, locks - look on their implementation and you will find CAS operations). They wouldn't likely be used if they wouldn't guarantee immediate cross-thread state consistency or if there would be other, faster mechanism(s)

At least on Intel devices, a bunch of machinecode operations can be prefixed with a LOCK prefix, which ensures that the following operation is treated as atomic, even if the underlying datatype won't fit on the databus in one go, for example, LOCK REPNE SCASB will scan a string of bytes for a terminating zero, and won't be interrupted by other threads. As far as I am aware, the Memory Barrier construct is basically a CAS based spinlock that causes a thread to wait for some Condition to be met, such as no other threads having any work to do. This is clearly a higher-level construct, but make no mistake there's a condition check in there, and it's likely to be atomic, and also likely to be CAS-protected, you're still going to pay the cache line price when you reach a memory barrier.

来源：https://stackoverflow.com/questions/24726904/memory-barrier-vs-interlocked-impact-on-memory-caches-coherency-timing

标签

multithreading

memory-barriers

interlocked