There is an illustration in kernel source Documentation/memory-barriers.txt, like this:
CPU 1 CPU 2 ======================
From the section of the document titled "WHAT MAY NOT BE ASSUMED ABOUT MEMORY BARRIERS?":
There is no guarantee that any of the memory accesses specified before a memory barrier will be complete by the completion of a memory barrier instruction; the barrier can be considered to draw a line in that CPU's access queue that accesses of the appropriate type may not cross.
and
There is no guarantee that a CPU will see the correct order of effects from a second CPU's accesses, even if the second CPU uses a memory barrier, unless the first CPU also uses a matching memory barrier (see the subsection on "SMP Barrier Pairing").
What memory barriers do (in a very simplified way, of course) is make sure neither the compiler nor in-CPU hardware perform any clever attempts at reordering load (or store) operations across a barrier, and that the CPU correctly perceives changes to the memory made by other parts of the system. This is necessary when the loads (or stores) carry additional meaning, like locking a lock before accessing whatever it is we're locking. In this case, letting the compiler/CPU make the accesses more efficient by reordering them is hazardous to the correct operation of our program.
When reading this document we need to keep two things in mind:
Fact #2 is one of the reasons why one CPU can perceive the data differently from another. While cache systems are designed to provide good performance and coherence in the general case, but might need some help in specific cases like the ones illustrated in the document.
In general, like the document suggests, barriers in systems involving more than one CPU should be paired to force the system to synchronize the perception of both (or all participating) CPUs. Picture a situation in which one CPU completes loads or stores and the main memory is updated, but the new data had yet to be transmitted to the second CPU's cache, resulting in a lack of coherence across both CPUs.
I hope this helps. I'd suggest reading memory-barriers.txt again with this in mind and particularly the section titled "THE EFFECTS OF THE CPU CACHE".
The key missing point is the mistaken assumption that for the sequence:
LOAD C (gets &B)
LOAD *C (reads B)
the first load has to precede the second load. A weakly ordered architectures can act "as if" the following happened:
LOAD B (reads B)
LOAD C (reads &B)
if( C!=&B )
LOAD *C
else
Congratulate self on having already loaded *C
The speculative "LOAD B" can happen, for example, because B was on the same cache line as some other variable of earlier interest or hardware prefetching grabbed it.