I originally thought cache coherency protocols such as MESI can provide pseudo-atomicity but only across individual memory-load/store instructi
Reading the excerpt you give, I don't find it contradictory to using of LOCK-ed instruction. For example, consider INC
instruction. Without the LOCK
, it can read the original value having its cache line in SHARED
state which does not prevent other cores on the same cache from concurrent reading of the same value before storing the same incremented result = data race.
I interpret the quote as the data integrity is guaranteed per cache line granularity, the additional care may not be necessary when the data fits one cache line. But if the the data crosses the boundary of two cache lines, it is necessary to assert that modifications for both of them will be treated atomically.
There's a difference between locking as a concept, and the actual bus #lock signal - the latter is one of the means of implementing the first. Cache locking is another one that is much simpler and more efficient.
MESI protocol guarantees that if a line is held exclusively by a certain core (either modified or not), no one else has it. In this case you can perform multiple operations atomically by adding simple flag in the cache that blocks external snoops until the operations are done. This would have the same effect as the lock concept dictates since no one else may change or even observe the intermediate values.
On more complicated cases, the line is not held by a single cache (for e.g. it may be shared between several ones, or the access may be split between two cache lines and only one is in your cache - the list of scenarios is usually implementation specific and probably not disclosed by the CPU manufacturer) - in such cases you may have to resort to "heavier" cannons like the bus lock, which usually guarantees no one can do anything on the shared bus. Obviously this has a huge impact on performance so this is probably only used when you have no other choice. In most cases a simple cache-level lock should be enough. Note that new schemes like Intel TSX seem to work in a similar manner, offering optimizations when you're working from within the cache.
By the way - your assumption about pseudo-atomicity for individual instruction is also wrong - it would be correct if you referred to a single memory operation (load or store), since an instruction may include multiple ones (inc [addr]
for e.g. would not be atomic without a lock). Another restriction which also appears in your quote is that the access needs to be contained in a cache line - split lines don't guarantee atomicity even within a single load or store (since they're usually implemented as 2 memory operations that are later merged).