问题
I am on Windows 7 64-bit, VS2013 (x64 Release build) experimenting with memory orderings. I want to share access to a container using the fastest synchronization. I opted for atomic compare-and-swap.
My program spawns two threads. A writer pushes to a vector and the reader detects this.
Initially I didn't specify any memory ordering, so I assume it uses memory_order_seq_cst
?
With memory_order_seq_cst
the latency is 340-380 cycles per op.
To try and improve performance I made stores use memory_order_release
and loads use memory_order_acquire
.
However, the latency increased to approx 1,940 cycles per op.
Have I misunderstood something? Full code below.
Using default memory_order_seq_cst
:
#include <iostream>
#include <atomic>
#include <thread>
#include <vector>
std::atomic<bool> _lock{ false };
std::vector<uint64_t> _vec;
std::atomic<uint64_t> _total{ 0 };
std::atomic<uint64_t> _counter{ 0 };
static const uint64_t LIMIT = 1000000;
void writer()
{
while (_counter < LIMIT)
{
bool expected{ false };
bool val = true;
if (_lock.compare_exchange_weak(expected, val))
{
_vec.push_back(__rdtsc());
_lock = false;
}
}
}
void reader()
{
while (_counter < LIMIT)
{
bool expected{ false };
bool val = true;
if (_lock.compare_exchange_weak(expected, val))
{
if (_vec.empty() == false)
{
const uint64_t latency = __rdtsc() - _vec[0];
_total += (latency);
++_counter;
_vec.clear();
}
_lock = false;
}
}
}
int main()
{
std::thread t1(writer);
std::thread t2(reader);
t2.detach();
t1.join();
std::cout << _total / _counter << " cycles per op" << std::endl;
}
Using memory_order_acquire
and memory_order_release
:
void writer()
{
while (_counter < LIMIT)
{
bool expected{ false };
bool val = true;
if (_lock.compare_exchange_weak(expected, val, std::memory_order_acquire))
{
_vec.push_back(__rdtsc());
_lock.store(false, std::memory_order_release);
}
}
}
void reader()
{
while (_counter < LIMIT)
{
bool expected{ false };
bool val = true;
if (_lock.compare_exchange_weak(expected, val, std::memory_order_acquire))
{
if (_vec.empty() == false)
{
const uint64_t latency = __rdtsc() - _vec[0];
_total += (latency);
++_counter;
_vec.clear();
}
_lock.store(false, std::memory_order_release);
}
}
}
回答1:
You don't have any protection against a thread taking the lock again right after releasing it, only to find _vec.empty()
was not false, or to store another TSC value, overwriting one that was never seen by the reader. I suspect your change lets the reader waste more time blocking the writer (and vice versa), leading to less actual throughput.
TL:DR: The real problem was lack of fairness in your locking (too easy for a thread that just unlocked to be the one that wins the race to lock it again), and the way you're using that lock. (You have to take it before you can determine whether there's anything useful to do, forcing the other thread to retry, and causing extra transfers of the cache line between cores.)
Having a thread re-acquire the lock without the other thread getting a turn is always useless and wasted work, unlike many real cases where it takes more repeats to fill up or empty a queue. This is a bad producer-consumer algorithm (queue too small (size 1), and/or the reader discards all vector elements after reading vec[0]
), and the worst possible locking scheme for it.
_lock.store(false, seq_cst);
compiles to xchg
instead of a plain mov
store.
It has to wait for the store buffer to drain and is just plain slow1 (On Skylake for example, microcoded as 8 uops, throughput of one per 23 cycles for many repeated back-to-back operations, in the no-contention case where it's already hot in L1d cache. You didn't specify anything about what hardware you have).
_lock.store(false, std::memory_order_release);
does just compile to a plain mov
store with no extra barrier instructions. So the reload of _counter
can happen in parallel with it (although branch prediction + speculative execution makes that a non-issue). And more importantly, the next CAS attempt to take the lock can actually try sooner.
There is hardware arbitration for access to a cache line when multiple cores are hammering on it, presumably with some fairness heuristics, but I don't know if the details are known.
Footnote 1: xchg
is not as slow as mov
+mfence
on some recent CPUs, especially Skylake-derived CPUs. It is the best way to implement a seq_cst pure store on x86. But it's slower than plain mov
.
You can completely solve by having your lock force alternating writer / reader
Writer waits for false
, then stores true
when it's done. Reader does the reverse. So the writer can never re-enter the critical section without the other thread having had a turn. (When you "wait for a value", do that read-only with a load, not a CAS. A CAS on x86 needs exclusive ownership of the cache line, preventing other threads from reading. With only one reader and one writer, you don't need any atomic RMWs for this to work.)
If you had multiple readers and multiple writers, you could have a 4-state sync variable where a writer tries to CAS it from 0 to 1, then stores 2 when it's done. Readers try to CAS from 2 to 3, then stores 0 when done.
The SPSC (single producer single consumer) case is simple:
enum lockstates { LK_WRITER=0, LK_READER=1, LK_EXIT=2 };
std::atomic<lockstates> shared_lock;
uint64_t shared_queue; // single entry
uint64_t global_total{ 0 }, global_counter{ 0 };
static const uint64_t LIMIT = 1000000;
void writer()
{
while(1) {
enum lockstates lk;
while ((lk = shared_lock.load(std::memory_order_acquire)) != LK_WRITER) {
if (lk == LK_EXIT)
return;
else
SPIN; // _mm_pause() or empty
}
//_vec.push_back(__rdtsc());
shared_queue = __rdtsc();
shared_lock.store(LK_READER, ORDER); // seq_cst or release
}
}
void reader()
{
uint64_t total=0, counter=0;
while(1) {
enum lockstates lk;
while ((lk = shared_lock.load(std::memory_order_acquire)) != LK_READER) {
SPIN; // _mm_pause() or empty
}
const uint64_t latency = __rdtsc() - shared_queue; // _vec[0];
//_vec.clear();
total += latency;
++counter;
if (counter < LIMIT) {
shared_lock.store(LK_WRITER, ORDER);
}else{
break; // must avoid storing a LK_WRITER right before LK_EXIT, otherwise writer races and can overwrite with LK_READER
}
}
global_total = total;
global_counter = counter;
shared_lock.store(LK_EXIT, ORDER);
}
Full version on Godbolt. On my i7-6700k Skylake desktop (2-core turbo = 4200MHz, TSC = 4008MHz), compiled with clang++ 9.0.1 -O3
. Data is pretty noisy, as expected; I did a bunch of runs and manually selected a low and high point, ignoring some real outlier highs that were probably due to warm-up effects.
On separate physical cores:
-DSPIN='_mm_pause()' -DORDER=std::memory_order_release
: ~180 to ~210 cycles / op, basically zeromachine_clears.memory_ordering
(like19
total over 1000000 ops, thanks topause
in the spin-wait loop.)-DSPIN='_mm_pause()' -DORDER=std::memory_order_seq_cst
: ~195 to ~215 ref cycles / op, same near-zero machine clears.-DSPIN='' -DORDER=std::memory_order_release
: ~195 to ~225 ref c/op, 9 to 10 M/sec machine clears withoutpause
.-DSPIN='' -DORDER=std::memory_order_seq_cst
: more variable and slower, ~250 to ~315 c/op, 8 to 10 M/sec machine clears withoutpause
These timings are about 3x faster than your seq_cst
"fast" original on my system. Using std::vector<>
instead of a scalar might account for ~4 cycles of that; I think there was a slight effect when I replaced it. Maybe just random noise, though. 200 / 4.008GHz is about 50ns inter-core latency, which sounds about right for a quad-core "client" chip.
From the best version (mo_release, spinning on pause
to avoid machine clears):
$ clang++ -Wall -g -DSPIN='_mm_pause()' -DORDER=std::memory_order_release -O3 inter-thread.cpp -pthread &&
perf stat --all-user -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u,machine_clears.memory_ordering -r4 ./a.out
195 ref cycles per op. total ticks: 195973463 / 1000000 ops
189 ref cycles per op. total ticks: 189439761 / 1000000 ops
193 ref cycles per op. total ticks: 193271479 / 1000000 ops
198 ref cycles per op. total ticks: 198413469 / 1000000 ops
Performance counter stats for './a.out' (4 runs):
199.83 msec task-clock:u # 1.985 CPUs utilized ( +- 1.23% )
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
128 page-faults # 0.643 K/sec ( +- 0.39% )
825,876,682 cycles:u # 4.133 GHz ( +- 1.26% )
10,680,088 branches:u # 53.445 M/sec ( +- 0.66% )
44,754,875 instructions:u # 0.05 insn per cycle ( +- 0.54% )
106,208,704 uops_issued.any:u # 531.491 M/sec ( +- 1.07% )
78,593,440 uops_executed.thread:u # 393.298 M/sec ( +- 0.60% )
19 machine_clears.memory_ordering # 0.094 K/sec ( +- 3.36% )
0.10067 +- 0.00123 seconds time elapsed ( +- 1.22% )
And from the worst version (mo_seq_cst, no pause
): the spin-wait loop spins faster so branches and uops issued/executed are much higher, but actual useful throughput is somewhat worse.
$ clang++ -Wall -g -DSPIN='' -DORDER=std::memory_order_seq_cst -O3 inter-thread.cpp -pthread &&
perf stat --all-user -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u,machine_clears.memory_ordering -r4 ./a.out
280 ref cycles per op. total ticks: 280529403 / 1000000 ops
215 ref cycles per op. total ticks: 215763699 / 1000000 ops
282 ref cycles per op. total ticks: 282170615 / 1000000 ops
174 ref cycles per op. total ticks: 174261685 / 1000000 ops
Performance counter stats for './a.out' (4 runs):
207.82 msec task-clock:u # 1.985 CPUs utilized ( +- 4.42% )
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
130 page-faults # 0.623 K/sec ( +- 0.67% )
857,989,286 cycles:u # 4.129 GHz ( +- 4.57% )
236,364,970 branches:u # 1137.362 M/sec ( +- 2.50% )
630,960,629 instructions:u # 0.74 insn per cycle ( +- 2.75% )
812,986,840 uops_issued.any:u # 3912.003 M/sec ( +- 5.98% )
637,070,771 uops_executed.thread:u # 3065.514 M/sec ( +- 4.51% )
1,565,106 machine_clears.memory_ordering # 7.531 M/sec ( +- 20.07% )
0.10468 +- 0.00459 seconds time elapsed ( +- 4.38% )
Pinning both reader and writer to the logical cores of one physical core speeds it up a lot: on my system, cores 3 and 7 are siblings so Linux taskset -c 3,7 ./a.out
stops the kernel from scheduling them anywhere else: 33 to 39 ref cycles per op, or 80 to 82 without pause
.
(What will be used for data exchange between threads are executing on one Core with HT?,)
$ clang++ -Wall -g -DSPIN='_mm_pause()' -DORDER=std::memory_order_release -O3 inter-thread.cpp -pthread &&
taskset -c 3,7 perf stat --all-user -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u,machine_clears.memory_ordering -r4 ./a.out
39 ref cycles per op. total ticks: 39085983 / 1000000 ops
37 ref cycles per op. total ticks: 37279590 / 1000000 ops
36 ref cycles per op. total ticks: 36663809 / 1000000 ops
33 ref cycles per op. total ticks: 33546524 / 1000000 ops
Performance counter stats for './a.out' (4 runs):
89.10 msec task-clock:u # 1.942 CPUs utilized ( +- 1.77% )
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
128 page-faults # 0.001 M/sec ( +- 0.45% )
365,711,339 cycles:u # 4.104 GHz ( +- 1.66% )
7,658,957 branches:u # 85.958 M/sec ( +- 0.67% )
34,693,352 instructions:u # 0.09 insn per cycle ( +- 0.53% )
84,261,390 uops_issued.any:u # 945.680 M/sec ( +- 0.45% )
71,114,444 uops_executed.thread:u # 798.130 M/sec ( +- 0.16% )
16 machine_clears.memory_ordering # 0.182 K/sec ( +- 1.54% )
0.04589 +- 0.00138 seconds time elapsed ( +- 3.01% )
On logical cores sharing the same physical core. Best case ~5x lower latency than between cores, again for pause + mo_release. But the actual benchmark only completing in 40% of the time, not 20%
-DSPIN='_mm_pause()' -DORDER=std::memory_order_release
: ~33 to ~39 ref cycles / op, near-zeromachine_clears.memory_ordering
-DSPIN='_mm_pause()' -DORDER=std::memory_order_seq_cst
: ~111 to ~113 ref cycles / op, 19 total machine clears. Surprisingly the worst!-DSPIN='' -DORDER=std::memory_order_release
: ~81 to ~84 ref cycles/op, ~12.5 M machine clears / sec.-DSPIN='' -DORDER=std::memory_order_seq_cst
: ~94 to ~96 c/op, 5 M/sec machine clears withoutpause
.
All of these tests are with clang++
which uses xchg
for seq_cst stores. g++
uses mov
+mfence
which is slower in the pause
cases, faster without pause
and with fewer machine clears. (For the hyperthread case.) Usually pretty similar for the separate cores case with pause
, but faster in the separate cores seq_cst without pause
case. (Again, on Skylake specifically, for this one test.)
More investigation of the original version:
Also worth checking perf counters for machine_clears.memory_ordering
(Why flush the pipeline for Memory Order Violation caused by other logical processors?).
I did check on my Skylake i7-6700k, and there wasn't a significant difference in rate of machine_clears.memory_ordering
per second (about 5M / sec for both the fast seq_cst and the slow release), at 4.2GHz.
The "cycles per op" result is surprisingly consistent for the seq_cst version (400 to 422). My CPU's TSC reference frequency is 4008MHz, actual core frequency 4200MHz at max turbo. I assume your CPU's max turbo is a higher relative to its reference frequency than mine if you got 340-380 cycle. And/or a different microarchitecture.
But I found wildly varying results for the mo_release
version: with GCC9.3.0 -O3
on Arch GNU/Linux: 5790 for one run, 2269 for another. With clang9.0.1 -O3
73346 and 7333 for two runs, yes really a factor of 10). That's a surprise. Neither version is making system calls to free/allocate memory when emptying / pushing the vector, and I'm not seeing a lot of memory-ordering machine clears from the clang version. With your original LIMIT, two runs with clang showed 1394 and 22101 cycles per op.
With clang++, even the seq_cst times are varying somewhat more than with GCC, and are higher, like 630 to 700. (g++ uses mov
+mfence
for seq_cst pure stores, clang++ uses xchg
like MSVC does).
Other perf counters with mo_release
are showing similar rates of instructions, branches, and uops per second, so I think that's an indication that the code is just spending more time spinning its wheels with the wrong thread in the critical section and the other stuck retrying.
Two perf runs, first is mo_release, second is mo_seq_cst.
$ clang++ -DORDER=std::memory_order_release -O3 inter-thread.cpp -pthread &&
perf stat --all-user -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u,machine_clears.memory_ordering -r1 ./a.out
27989 cycles per op
Performance counter stats for './a.out':
16,350.66 msec task-clock:u # 2.000 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
231 page-faults # 0.014 K/sec
67,412,606,699 cycles:u # 4.123 GHz
697,024,141 branches:u # 42.630 M/sec
3,090,238,185 instructions:u # 0.05 insn per cycle
35,317,247,745 uops_issued.any:u # 2159.989 M/sec
17,580,390,316 uops_executed.thread:u # 1075.210 M/sec
125,365,500 machine_clears.memory_ordering # 7.667 M/sec
8.176141807 seconds time elapsed
16.342571000 seconds user
0.000000000 seconds sys
$ clang++ -DORDER=std::memory_order_seq_cst -O3 inter-thread.cpp -pthread &&
perf stat --all-user -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u,machine_clears.memory_ordering -r1 ./a.out
779 cycles per op
Performance counter stats for './a.out':
875.59 msec task-clock:u # 1.996 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
137 page-faults # 0.156 K/sec
3,619,660,607 cycles:u # 4.134 GHz
28,100,896 branches:u # 32.094 M/sec
114,893,965 instructions:u # 0.03 insn per cycle
1,956,774,777 uops_issued.any:u # 2234.806 M/sec
1,030,510,882 uops_executed.thread:u # 1176.932 M/sec
8,869,793 machine_clears.memory_ordering # 10.130 M/sec
0.438589812 seconds time elapsed
0.875432000 seconds user
0.000000000 seconds sys
I modified your code with the memory order as a CPP macro so you can compile with -DORDER=std::memory_order_release
to get the slow version.acquire
vs. seq_cst
doesn't matter here; it compiles to the same asm on x86 for loads and atomic RMWs. Only pure stores need special asm for seq_cst.
Also you left out stdint.h
and intrin.h
(MSVC) / x86intrin.h
(everything else). The fixed version is on Godbolt with clang and MSVC. Earlier I bumped up LIMIT by a factor of 10 to make sure the CPU frequency had time to ramp up to max turbo most most of the timed region, but reverted that change so testing mo_release
would only take seconds, not minutes.
Setting the LIMIT to check for a certain total TSC cycles might help it exit in a more consistent time. That still doesn't count time where the writer is locked out, but on the whole should runs that take an extremely long time less likely.
You also have a lot of very over-complicated stuff going on if you're just trying to measure inter-thread latency.
(How does the communication between CPU happen?)
You have both threads reading a _total
that the writer updates every time, instead of just storing a flag when it's all done. So the writer has potential memory-ordering machine clears from reading that variable written by another thread.
You also have an atomic RMW increment of _counter
in the reader, even though that variable is private to the reader. It could be a plain non-atomic global that you read after reader.join()
, or even better it could be a local variable that you only store to a global after the loop. (A plain non-atomic global would probably still end up getting stored to memory every iteration instead of kept in a register, because of the release stores. And since this is a tiny program, all the globals are probably next to each other, and likely in the same cache line.)
std::vector
is also unnecessary. __rdtsc()
is not going to be zero unless it wraps around the 64-bit counter1, so you can just use 0
as a sentinel value in a scalar uint64_t
to mean empty. Or if you fix your locking so the reader can't re-enter the critical section without the writer having a turn, you can remove that check.
Footnote 2: For a ~4GHz TSC reference frequency, that's 2^64 / 10^9 seconds, close enough to 2^32 seconds ~= 136 years to wrap around the TSC. Note that the TSC reference frequency is not the current core clock frequency; it's fixed to some value for a given CPU. Usually close to the rated "sticker" frequency, not max turbo.
Also, names with a leading _
are reserved at global scope in ISO C++. Don't use them for your own variables. (And generally not anywhere. You can use a trailing underscore instead if you really want.)
来源:https://stackoverflow.com/questions/61649951/c-latency-increases-when-memory-ordering-is-relaxed