Do the Linux glibc pthread functions on x86_64 act as fences for weakly-ordered memory accesses? (pthread_mutex_lock/unlock are the exact functions I\'m interested in).
Non-temporal stores need sfence
instruction to be ordered properly.
However, the efficient user-level implementation of a simple mutex supposes that it is released by a simple write which does not imply write-buffers flush, in contrast to atomic read-modify-write operations like lock cmpxchg
which imply full memory fence.
So you have a situation when the unlock
has no effect of store-with-release
semantic applied for non-temporal stores. Thus, these SSE stores can be reordered after the unlock and after another thread acquires the mutex.