How does CLFLUSH work for an address that is not in cache yet?

后端 未结 2 1156
天命终不由人
天命终不由人 2021-02-05 13:16

We are trying to use the Intel CLFLUSH instruction to flush the cache content of a process in Linux at the userspace.

We create a very simple C program that first acces

2条回答
  •  小鲜肉
    小鲜肉 (楼主)
    2021-02-05 14:06

    This doesn't explain the knee in the read-only graph, but does explain why it doesn't plateau.


    I didn't get around to testing locally to look into the difference between the hot and cold cache case, but I did come across a performance number for clflush:

    This AIDA64 instruction latency/throughput benchmark repository lists a single-socket Haswell-E CPU (i7-5820K) as having a clflush throughput of one per ~99.08 cycles. It doesn't say whether that's for the same address repeatedly, or what.

    So clflush isn't anywhere near free even when it doesn't have to do any work. It's still a microcoded instruction, not heavily optimized because it's usually not a big part of the CPUs workload.

    Skylake is getting ready for that to change, with support for persistent memory connected to the memory controller: On Skylake (i5-6400T), measured throughput was:

    • clflush: one per ~66.42cycles
    • clflushopt: one per ~56.33cycles

    Perhaps clflushopt is more of a win when some of the lines are actually dirty cache that needs flushing, maybe when L3 is busy from other cores doing the same thing. Or maybe they just want to get software using the weakly-ordered version ASAP, before making even bigger improvements to throughput. It's ~15% faster in this case, which is not bad.

提交回复
热议问题