I thought that I understood how L1D write miss is handled, but thinking carefully about it made me confused.
Here is an assembly language fragment:
Starting an RFO can be separate from placing the store data into an LFB; e.g. starting RFOs early for entries that aren't yet at the head of the store buffer can allow memory-level parallelism for stores. What you've proved is that for that to happen, store data can't always move into an LFB (Line Fill Buffer, also used for NT / WC stores).
If an RFO could only happen by moving store data from the store buffer (SB) into an LFB, then yes, you could only RFO for the head of the SB, not in parallel for any graduated entry. (A "graduated" store is one whose uops have retired from the ROB, i.e. become non-speculative). But if you don't have that requirement, you could RFO even earlier, even speculatively, but you probably wouldn't want to.1
(Given @BeeOnRope's findings about how multiple cache-miss stores to the same line can commit into an LFB, and then another LFB for another line, this might be the mechanism for having multiple RFOs in flight, not just the SB head. We'd have to check if an ABA store pattern limited memory-level parallelism. If that's the case, then maybe starting an RFO is the same as moving the data from the SB to an LFB, freeing that SB entry. But note that the new head of the SB still couldn't commit until those pending RFOs complete and commit the stores from the LFBs.)
On a store miss, the store buffer entry holds the store data until the RFO is complete, and commits straight into L1d (flipping the line from Exclusive to Modified state). Strong ordering is ensured by in-order commit from the head of the store buffer2.
As @HadiBrais wrote in answer to Where is the Write-Combining Buffer located? x86
My understanding is that for cacheable stores, only the RFO request is held in the LFB, but the data to be store waits in the store buffer until the target line is fetched into the LFB entry allocated for it. This is supported by the following statement from Section 2.4.5.2 of the Intel optimization manual:
The L1 DCache can maintain up to 64 load micro-ops from allocation until retirement. It can maintain up to 36 store operations from allocation until the store value is committed to the cache, or written to the line fill buffers (LFB) in the case of non-temporal stores.
This is pretty much fine for thinking about performance tuning, but probably not MDS vulnerabilities that can speculatively use stale data that faulting loads read from an LFB or whatever.
Any store coalescing or other tricks must necessarily respect the memory model.
We know CPUs can't violate their memory model, and that speculation + roll back isn't an option for commit to globally-visible state like L1d, or for graduated stores in general because the uops are gone from the ROB. They've already happened as far as local OoO exec is concerned, it's just a matter of when they'll become visible to other cores. Also we know that LFBs themselves are not globally visible. (There's some indication that LFBs are snooped by loads from this core, like the store buffer, but as far as MESI states they're more like an extension of the store buffer.)
@BeeOnRope has done some more experiments, finding some evidence that a series of stores like AAABBCCCC can drain into three LFBs, for lines A, B, C. RWT thread with an experiment that demonstrates a 4x perf difference predicted by this theory.
This implies that the CPU can track order between LFBs, although still not within a single LFB of course. A sequence like AAABBCCCCA (or ABA) would not be able to commit past the final A
store because the "current head" LFB is for line C, and there's already an LFB waiting for line A to arrive. A 4th line (D) would be ok, opening a new LFB, but adding to an already-open LFB waiting for an RFO that isn't the head is not ok. See @Bee's summary in comments.
All of this is only tested for Intel CPUs, AFAIK.
(This section not updated in light of @BeeOnRope's new discovery).
There's also no solid evidence of any kind of store merging / coalescing in the store buffer on modern Intel or AMD CPUs, or of using a WC buffer (LFB on Intel) to hold store data while waiting for a cache line to arrive. See discussion in comments under Are two store buffer entries needed for split line/page stores on recent Intel?. We can't rule out some minor form of it near the commit end of the store buffer.
We know that some weakly-ordered RISCs microarchitectures definitely do merge stores before they commit, especially to create a full 4-byte or 8-byte write of a cache ECC granule to avoid an RMW cycle. But Intel CPUs don't have any penalty for narrow or unaligned stores within a cache line.
For a while @BeeOnRope and I thought there was some evidence of store coalescing, but we've changed our minds. Size of store buffers on Intel hardware? What exactly is a store buffer? has some more details (and links to older discussions).
(Update: and now there is finally evidence of store coalescing, and an explanation of a mechanism that makes sense.)
Footnote 1: An RFO costs shared bandwidth and steals the line from other cores, slowing them down. And you might lose the line again before you get to actually commit into it if you RFO too early. LFBs are also needed for loads, which you don't want to starve (because execution stalls when waiting for load results). Loads are fundamentally different from stores, and generally prioritized.
So waiting at least for the store to graduate is a good plan, and maybe only initiating RFOs for the last few store-buffer entries before the head. (You need to check if L1d already owns the line before starting an RFO, and that takes a cache read port for at least the tags, although not data. I might guess that the store buffer checks 1 entry at a time and marks an entry as likely not needing an RFO.) Also note that 1 SB entry could be a misaligned cache-split store and touch 2 cache lines, requiring up to 2 RFOs...
Footnote 2: Store buffer entries are allocated in program order (at the tail of the buffer), as instructions / uops are issued into the out-of-order back end and have back-end resources allocated for them. (e.g. a physical register for uops that write a register, a branch-order-buffer entry for conditional branch uops that might mispredict.) See also Size of store buffers on Intel hardware? What exactly is a store buffer?. In-order alloc and commit guarantee program-order visibility of stores. The store buffer insulates globally-visible commit from out-of-order speculative execution of store-address and store-data uops (which write store-buffer entries), and decouples execution in general from waiting for cache-miss stores, until the store buffer fills up.
PS Intel calls the store buffer + load buffers collectively the memory order buffer (MOB), because they need to know about each other to track speculative early loads. This isn't relevant to your question, only for the case of speculative early loads and detecting memory-order mis-speculation and nuking the pipeline.
For retired store instructions (more specifically their "graduated" store buffer entries), it is just the store buffer that has to commit to L1d in program order.