We are trying to use the Intel CLFLUSH instruction to flush the cache content of a process in Linux at the userspace.
We create a very simple C program that first acces
This doesn't explain the knee in the read-only graph, but does explain why it doesn't plateau.
I didn't get around to testing locally to look into the difference between the hot and cold cache case, but I did come across a performance number for clflush
:
This AIDA64 instruction latency/throughput benchmark repository lists a single-socket Haswell-E CPU (i7-5820K) as having a clflush
throughput of one per ~99.08 cycles. It doesn't say whether that's for the same address repeatedly, or what.
So clflush
isn't anywhere near free even when it doesn't have to do any work. It's still a microcoded instruction, not heavily optimized because it's usually not a big part of the CPUs workload.
Skylake is getting ready for that to change, with support for persistent memory connected to the memory controller: On Skylake (i5-6400T), measured throughput was:
clflush
: one per ~66.42cycles clflushopt
: one per ~56.33cyclesPerhaps clflushopt
is more of a win when some of the lines are actually dirty cache that needs flushing, maybe when L3 is busy from other cores doing the same thing. Or maybe they just want to get software using the weakly-ordered version ASAP, before making even bigger improvements to throughput. It's ~15% faster in this case, which is not bad.