问题
Another question about L2/L3 caches explained that L3 can be used for inter process communication (IPC).
Are there other methods/pathways for this communication to happen?
The reason why it seems that there are other pathways is because Intel nearly halved the amount of L3 cache per core in their newest processor lineup (1.375 MiB per core in SKL-X) vs. previous generations (2.5 MiB per core in Broadwell EP).
Per-core private L2 increased from 256k to 1M, though.
回答1:
There are inter-processor_interrupts, but that's not new, and not used directly by normal multi-threaded software. The kernel might use an IPI to wake up another core from low-power sleep, or maybe not notify it that a high-priority task became runnable after a task on this CPU released an OS-assisted lock / mutex that other tasks were waiting for.
So really no, there are no other pathways.
Reduced size means you have to design your software to reuse data sooner if you want it to still be hot in L3 when a consumer thread gets to it. But note that it's unlikely that the only data in L3 is data that was written by one core and will next be read by another; most multi-threaded workloads involve plenty of private data, too. Also note that SKX L3 is not inclusive, so shared read-only data can stay hot in L2 of the core(s) using it even when it's been evicted from L3.
It would be really nice for developers if L3 was gigantic and fast, but it isn't. Besides the reduced size of L3, the bandwidth and latency is also significantly worse in SKX than in BDW. See @Mysticial's comments about y-cruncher performance:
The L3 cache mesh on Skylake X only has about half the bandwidth of the L3 cache on the previous generation Haswell/Broadwell-EP processors. The Skylake X L3 cache is so slow that it's barely faster than main memory in terms of bandwidth. So for all practical purposes, it's as good as non-existant.
He's not talking about communication between threads, just the amount of useful cache per core for independent threads. But AFAIK, a producer/consumer model should be pretty similar.
From the software optimization standpoint, the cache bottleneck brings a new set of difficulties. The L2 cache is fine. It is 4x larger than before and has doubled in bandwidth to keep up with the AVX512. But the L3 is useless. The net effect is that the usable cache per core is halved compared to the previous Haswell/Broadwell generations. Furthermore, doubling of the SIMD size with AVX512 makes the usable cache 4x smaller than before in terms of # of SIMD words that fit in cache.
Given all that, it may not make a huge difference whether producer/consumer threads hit in L3 or go to main memory. Fortunately, DRAM is pretty fast with high aggregate bandwidth if many threads are active. Single-thread max bandwidth is still lower than in Broadwell.
Inter-thread bandwidth benchmark numbers:
SiSoft has an inter-core bandwidth and latency benchmark. Description here.
For a 10-core (20 thread) SKX (i9-7900X CPU @ nominal 3.30GHz), the highest result is one overclocked to 4.82GHz cores with 3.2GHz memory, achieving an aggregate(?) bandwidth of 105.84GB/s and latency of 54.9ns.
One of the lowest results is with 4GHz/4.5GHz cores, and 2.4GHz IMC: 66.11GB/s bandwidth, 76.6ns latency. (Scroll to the bottom of the page to see other submissions for the same CPU).
By comparison, a desktop Skylake i7-6700k (4C 8T 4.21GHz, 4.1GHz IMC) scores 35.51GB/s and 40.5ns. Some more overclocked results are 42.72GB/s and 36.3ns.
For a single pair of threads, I think SKL-desktop is faster than SKX. I think this benchmark is measuring aggregate bandwidth between 20 threads on the 10C/20T CPU.
This single-threaded benchmark shows only about 20GB/s for SKL-X for block sizes from 2MB to 8MB, pretty much exactly the same as main memory bandwidth. The Kaby Lake quad-core i7-7700k on the graph looks like maybe 60GB/s. It's not plausible that inter-thread bandwidth is higher than single-thread bandwidth for the SKX, unless SiSoft Sandra is counting loads + stores for the inter-thread case. (Single-thread bandwidth tends to suck on Intel many-core CPUs: see the "latency-bound platform" section of this answer. Higher L3 latency means bandwidth is limited by the number of outstanding L1 or L2 misses / prefetch requests.)
Another complication is that when running with hyperthreading enabled, some inter-thread communication may happen through L1D / L2 if the block size is small enough. See What will be used for data exchange between threads are executing on one Core with HT?, and also What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?.
I don't know how that benchmark pins threads to logical cores, and whether they try to avoid or maximize communication between logical cores of the same physical core.
When designing a multi-threaded application, aim for memory locality within each thread. Try to avoid passing huge blocks of memory between threads, because that's less efficient even in previous CPUs. SKL-AVX512 aka SKL-SP aka SKL-X aka SKX just makes it worse than before.
Synchronize between threads with flag variables or progress counters.
If memory bandwidth between threads is your biggest bottleneck, you should consider just doing the work in the producer thread (especially on the fly as the data is being written, instead of in separate passes), instead of using a separate thread at all. i.e. that maybe one of the boundaries between threads is not in an ideal place in your design.
Real life software design is complicated, and sometimes you end up having to choose between poor options.
Hardware design is complicated, too, with lots of tradeoffs. Although it appears that SKX's L3 cache + mesh seem to do worse than the old ring bus setup for medium core count chips. Presumably it is a win for the biggest chips for some kinds of workloads. Hopefully future generations will have better single-core latency / bandwidth.
来源:https://stackoverflow.com/questions/46131651/how-does-the-communication-between-cpu-happen