From the speech Herb Sutter in the figure of the slides on page 2: https://skydrive.live.com/view.aspx?resid=4E86B0CF20EF15AD!24884&app=WordPdf&wdo=
The store buffer is not a cache, it's an ordering queue. It holds pending stores, while the cache can be thought of as a logical part of memory (i.e. - everything in any of the caches is visible to all other agents and must answer correctly to snoops)
Stores are not reordered, that would break memory ordering as they would become immediately visible (unlike loads who only affect internal state).
fences do not work on caches, and have nothing to do with other cores. Caches are already fully visible and synched. fences only apply for execution order (in case it's done out-of-order internally), and therefore apply only for the current context.
Is correct to say that:
- SFENCE makes "push", ie makes flush for Store Buffer->L1, and then sends changes from the caches of Core0-L1/L2 to all other cores Core1/2/3...-L1/L2?
- LFENCE makes "pull", ie receives changes from caches of all other Core1/2/3...-L1/L2( and Store Buffer?) in our core Core0-L1/L2?
sfence/mfence would flush the store buffer as they won't allow pending speculative stores to remain (that's why they're fencing). However as I said - once they changes are in L1 they're already observable by anyone, they don't have to be flushed anywhere further away.
In the same sense, lfence doesn't "pull" anything, it just stalls the execution of all younger loads until the older ones (and the fence itself) have finished and committed. This will affect performance by serializing the loads, but would not otherwise protect you against any operation in other cores, unless you have another way to make sure any store you require would have been performed by then (and in that case - update the load result in time).