Consider the following code that writes the same value to the same memory location from multiple threads:
void f(int* buf, int n, int* p) {
for(int i = 0
The key point here is indeed, as @Maxim said, cache coherency. In a cache coherent architecture it's impossible indeed.
However, it can go wrong on a machine with no cache coherency. I don't know a specific architecture, and although they're almost extinct due to natural selection, as far as I know there are some remaining. (If you know an example, please comment.)
Here is a table that represents an execution of two threads filling a zeroed region in memory with ones. For brevity this example is scaled down by a factor of 32, i.e. each digit here represents a 4-byte int in question. Cache line size is 4 ints == 4 digits. The lines marked as "flush" are points where the on-chip cache is flushed to the main memory. In reality it's non-deterministic, as it may happen at any time, e.g. due to a preemptive task switch.
Core 1 cache Memory Core 2 cache
------------------------------------------------------------------------------
0000
0000 (load cache) 0000
1000 (set 1st bit) 0000
1100 (set 2nd bit) 0000 0000 (load cache)
**** (flush) 1100
1100 1000 (set 1st bit)
1000 **** (flush)
1000 1000 (load cache)
1000 1100 (set 2nd bit)
1000 (load cache) 1000 1110 (set 3rd bit)
1010 (set 3rd bit) 1000 1111 (set 4th bit)
1011 (set 4th bit) 1111 **** (flush)
**** (flush) 1011
So we got a wrong result in the end.
I emphasize again that this counter-example is valid only on cache incoherent machines.
Memory models with regards to multi-treading concern when the effects of writes made by one thread are observable by another thread. In the code you posted both threads write the same values into the same memory location, so it doesn't matter which thread's write buf[n/2]
reads, either will do.
Modern processors employ cache coherency protocols, such as MESI, so when the threads write to the buffer concurrently there is going to be a lot of messages sent between the CPUs to synchronize the cache lines holding the buffer making it run much slower than in non-concurrent scenario (false sharing effect).
Here it doesn't matter if the writes are atomic or not, since both threads write the same values to the same memory locations. There is a race, but it doesn't matter which thread wins because the observed values are going to be the same even with partial writes.
There is a race, but in your example both threads will write the same values to the same addresses. Since you are not doing any read-modify-writes, but just writing predetermined numbers, this will be safe in most cases. Writing an int will be an atomic instruction on most systems. The exception would be if you ran this code on a 8-bit microprocessor that uses a sequence of instructions to store an int. In that case it also may still work, but depends on the implementation of the library code that does the multi-byte store.