We are trying to use the Intel CLFLUSH instruction to flush the cache content of a process in Linux at the userspace.
We create a very simple C program that first acces
You want to look at the new optimization guide for Skylake, Intel came out with another version of clflush, called clflush_opt, which is weakly ordered and would perform much better in your scenario.
See section 7.5.7 in here - http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
In general, CLFLUSHOPT throughput is higher than that of CLFLUSH, because CLFLUSHOPT orders itself with respect to a smaller set of memory traffic as described above and in Section 7.5.6. The throughput of CLFLUSHOPT will also vary. When using CLFLUSHOPT, flushing modified cache lines will experience a higher cost than flushing cache lines in non-modi fied states. CLFLUSHOPT will provide a performance benefit over CLFLUSH for cache lines in any coherenc e states. CLFLUSHOPT is more suitable to flush large buffers (e.g. greater than many KBytes), comp ared to CLFLUSH. In single-threaded applications, flushing buffers using CLFLUSHOPT may be up to 9X better than using CLFLUSH with Skylake microarchi- tecture.
The section also explains that flushing modified data is slower, which obviously comes from the writeback penalty.
As for the increasing latency, are you measuring the overall time is takes to go over the address range and clflush each line? In that case you're linearly dependent on the array size, even when it passes the LLC size. Even if the lines aren't there, the clflush would have to get processed by the execution engine and memory unit, and lookup the entire cache hierarchy for each line, even if it's not present.