We are trying to use the Intel CLFLUSH instruction to flush the cache content of a process in Linux at the userspace.
We create a very simple C program that first acces
You want to look at the new optimization guide for Skylake, Intel came out with another version of clflush, called clflush_opt, which is weakly ordered and would perform much better in your scenario.
See section 7.5.7 in here - http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
In general, CLFLUSHOPT throughput is higher than that of CLFLUSH, because CLFLUSHOPT orders itself with respect to a smaller set of memory traffic as described above and in Section 7.5.6. The throughput of CLFLUSHOPT will also vary. When using CLFLUSHOPT, flushing modified cache lines will experience a higher cost than flushing cache lines in non-modi fied states. CLFLUSHOPT will provide a performance benefit over CLFLUSH for cache lines in any coherenc e states. CLFLUSHOPT is more suitable to flush large buffers (e.g. greater than many KBytes), comp ared to CLFLUSH. In single-threaded applications, flushing buffers using CLFLUSHOPT may be up to 9X better than using CLFLUSH with Skylake microarchi- tecture.
The section also explains that flushing modified data is slower, which obviously comes from the writeback penalty.
As for the increasing latency, are you measuring the overall time is takes to go over the address range and clflush each line? In that case you're linearly dependent on the array size, even when it passes the LLC size. Even if the lines aren't there, the clflush would have to get processed by the execution engine and memory unit, and lookup the entire cache hierarchy for each line, even if it's not present.
This doesn't explain the knee in the read-only graph, but does explain why it doesn't plateau.
I didn't get around to testing locally to look into the difference between the hot and cold cache case, but I did come across a performance number for clflush
:
This AIDA64 instruction latency/throughput benchmark repository lists a single-socket Haswell-E CPU (i7-5820K) as having a clflush
throughput of one per ~99.08 cycles. It doesn't say whether that's for the same address repeatedly, or what.
So clflush
isn't anywhere near free even when it doesn't have to do any work. It's still a microcoded instruction, not heavily optimized because it's usually not a big part of the CPUs workload.
Skylake is getting ready for that to change, with support for persistent memory connected to the memory controller: On Skylake (i5-6400T), measured throughput was:
clflush
: one per ~66.42cycles clflushopt
: one per ~56.33cyclesPerhaps clflushopt
is more of a win when some of the lines are actually dirty cache that needs flushing, maybe when L3 is busy from other cores doing the same thing. Or maybe they just want to get software using the weakly-ordered version ASAP, before making even bigger improvements to throughput. It's ~15% faster in this case, which is not bad.