发表新帖

发表新帖

How does CLFLUSH work for an address that is not in cache yet?

后端未结

关注

 2  1162

天命终不由人 2021-02-05 13:16

We are trying to use the Intel CLFLUSH instruction to flush the cache content of a process in Linux at the userspace.

We create a very simple C program that first acces

2条回答

小鲜肉 (楼主)

2021-02-05 14:06
This doesn't explain the knee in the read-only graph, but does explain why it doesn't plateau.

I didn't get around to testing locally to look into the difference between the hot and cold cache case, but I did come across a performance number for clflush:

This AIDA64 instruction latency/throughput benchmark repository lists a single-socket Haswell-E CPU (i7-5820K) as having a clflush throughput of one per ~99.08 cycles. It doesn't say whether that's for the same address repeatedly, or what.

So clflush isn't anywhere near free even when it doesn't have to do any work. It's still a microcoded instruction, not heavily optimized because it's usually not a big part of the CPUs workload.

Skylake is getting ready for that to change, with support for persistent memory connected to the memory controller: On Skylake (i5-6400T), measured throughput was:
- clflush: one per ~66.42cycles
- clflushopt: one per ~56.33cycles
Perhaps clflushopt is more of a win when some of the lines are actually dirty cache that needs flushing, maybe when L3 is busy from other cores doing the same thing. Or maybe they just want to get software using the weakly-ordered version ASAP, before making even bigger improvements to throughput. It's ~15% faster in this case, which is not bad.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题