The Intel 64 and IA-32 Architectures Software Developer\'s Manual says the following about re-ordering of actions by a single processor (Section 8.2.2, \"Memory Ordering in
The naming is a bit awkward. The "forwarding" happens inside a core/logical processor, as follows. If you first do a STORE, it will go to the store buffer to be flushed to memory asynchronously. If you do a subsequent LOAD to the same location ON THE SAME PROCESSOR before the value is flushed to the cache/memory, the value from the store buffer will be "forwarded" and you will get the value that was just stored. The read is "passing" the write in that it happens before the actual write from store-buffer to memory (which has yet to happen).
The statement isn't saying much actually if you just care about the ordering rules - this forwarding is a detail of what they do internally to guarantee that (on a processor) reads are not reordered with older writes to the same location (part of the rule you quoted).
Despite what some of the other answers here state, there is (at least as far as ordering guarantees go) NO store-buffer forwarding/snooping between processors/cores, as the 8.2.3.5 "Intra-Processor Forwarding Is Allowed" example in the manual shows.
I'd guess that the hang-up is the notion of a "store-buffer". Starting point is the great disparity between the speed of a processor core and the speed of memory. A modern core can easily execute a dozen instructions in a nanosecond. But a RAM-chip can require 150 nanoseconds to deliver a value stored in memory. That is an enormous mismatch, modern processors are filled to the brim with tricks to work around that problem.
Reads are the harder problem to solve, a processor will stall and not execute any code when it needs to wait for the memory sub-system to deliver a value. An important sub-unit in a processor is the prefetcher. It tries to predict what memory locations will be loaded by the program. So it can tell the memory sub-system to read them ahead of time. So physical reads occur much sooner than the logical loads in your program.
Writes are easier, a processor has a buffer for them. Model them like a queue in software. So the execution engine can quickly dump the store instruction into the queue and won't get bogged down waiting for the physical write to occur. This is the store-buffer. So physical writes to memory occur much later than the logical stores in your program.
The trouble starts when your program uses more than one thread and they access the same memory locations. Those threads will run on different cores. Many problems with this, ordering becomes very important. Clearly the early reads performed by the prefetcher causes it to read stale values. And the late writes performed by the store buffer make it worse yet. Solving it requires synchronization between the threads. Which is very expensive, a processor is easily stalled for dozens of nanoseconds, waiting for the memory sub-system to catch up. Instead of threads making your program faster, they can actually make it slower.
The processor can help, store-buffer forwarding is one such trick. A logical read in one thread can pass a physical write initiated by another thread when the store is still in the buffer and has not been executed yet. Without synchronization in the program that will always cause the thread to read a stale value. What store-buffer forwarding does is look through the pending stores in the buffer and find the latest write that matches the read address. That "forwards" the store in time, making it look like it was executed earlier than it will be. The thread gets the actual value; the one that, eventually, ends up in memory. The read no longer passes the write.
Actually writing a program that takes advantage of store-buffer forwarding is rather unadvisable. Short from the very iffy timing, such a program will port very, very poorly. Intel processors have a strong memory model with the ordering guarantees it provides. But you can't ignore the kind of processors that popular on mobile devices these days. Which consume a lot less power by not providing such guarantees.
And the feature can in fact be very detrimental, it hides synchronization bugs in your code. They are the worst possible bugs to diagnose. Micro-processors have been staggering successful over the past 30 years. They however did not get easier to program.
8.2.3.5 "Intra-Processor Forwarding Is Allowed" explains an example of store-buffer forwarding:
Initially x = y = 0
Processor 0 Processor 1 ============== ============= mov [x], 1 mov [y], 1 mov r1, [x] mov r3, [y] mov r2, [y] mov r4, [x]
The result
r2 == 0
andr4 == 0
is allowed.... the reordering in this example can arise as a result of store-buffer forwarding. While a store is temporarily held in a processor's store buffer, it can satisfy the processor's own loads but is not visible to (and cannot satisfy) loads by other processors.
The statement that says reads can't be reordered with writes to the same location ("Reads may be reordered with older writes to different locations but not with older writes to the same location") is in a section that applies to "a single-processor system for memory regions defined as write-back cacheable". The "store-buffer forwarding" behavior applies to multi-processor behavior only.