Why are the user-mode L1 store miss events only counted when there is a store initialization loop?

时光总嘲笑我的痴心妄想 提交于 2019-12-01 07:39:38

You didn't flag your OS, but let's assume you are using Linux. This stuff would be different on another OS (and perhaps even within various variants of the same OS).

On a read access to an unmapped page, the kernel page fault handler maps in a system-wide shared zero page, with read-only permissions.

This explains columns LoadInit-1U|K: even though your init load is striding over a virtual area of 64 MB performing loads, only a single physical 4K page filled with zeros is mapped, so you get approximately zero cache misses after the first 4KB, which rounds to zero after your normalization.1

On a write access to an unmapped page, or to the read-only shared zero page, the kernel will map a a new unique page on behalf of the process. This new page is guaranteed to be zeroed, so unless the kernel has some known-to-be-zero pages hanging around, this involves zeroing the page (effectively memset(new_page, 0, 4096)) prior to mapping it.

That largely explains the remaining columns except for StoreInit-2U|K. In those cases, even though it seems like the user program is doing all the stores, the kernel ends up doing all of the hard work (except for one store per page) since as the user process faults in each page, the kernel writes zeros to it, which has the side effect of bringing all the pages into the L1 cache. When the fault handler returns, the triggering store and all subsequent stores for that page will hit in the L1 cache.

It still doesn't fully explain StoreInit-2. As clarified in the comments, the K column actually includes the user counts, which explains that column (subtracting out the user counts leaves it at roughly zero for every event, as expected). The remaining confusion is why L2_RQSTS.ALL_RFO is not 1 but some smaller value like 0.53 or 0.68. Maybe the event is undercounting, or there is some micro-architectural effect that we're missing, like a type of prefetch that prevents the RFO (for example, if the line is loaded into the L1 by some type of load operation before the store, the RFO won't occur). You could try to include the other L2_RQSTS events to see if the missing events show up there.

Variations

It doesn't need to be like that on all systems. Certainly other OSes may have different strategies, but even Linux on x86 might behave differently based on various factors.

For example, rather than the 4K zero page, you might get allocated a 2 MiB huge zero page. That would change the benchmark since 2 MiB doesn't fit in L1, so the LoadInit tests will probably show misses in user-space on the first and second loops.

More generally, if you were using huge pages, the page fault granularity would be changed from 4 KiB to 2 MiB, meaning that only a small part of the zeroed page would remain in L1 and L2, so you'd get L1 and L2 misses, as you expected. If your kernel ever implements fault-around for anonymous mappings (or whatever mapping you are using), it could have a similar effect.

Another possibility is that the kernel may zero pages in the background and so have zero pages ready. This would remove the K counts from the tests, since the zeroing doesn't happen during the page fault, and would probably add the expected misses to the user counts. I'm not sure if the Linux kernel ever did this or has the option to do it, but there were patches floating around. Other OSes like BSD have done it.

RFO Prefetchers

About "RFO prefetchers" - the RFO prefetchers are not really prefetchers in the usual sense and they are unrelated to the L1D prefetchers can be turned off. As far as I know "RFO prefetching" from the L1D simply refers to sending an RFO request for stores in the store buffer which are reaching the head of the store buffer. Obviously when a store gets to the head of the buffer, it's time to send an RFO, and you wouldn't call that a prefetch - but why not send some requests for the second-from-the-head store too, and so on? Those are the RFO prefetches, but they differ from a normal prefetch in that the core knows the address that has been requested: it is not a guess.

There is speculation in the sense that getting additional lines other than the current head may be wasted work if another core sends an RFO for that line before the core has a chance to write from it: the request was useless in that case and just increased coherency traffic. So there are predictors that may reduce this store buffer prefetch if it fails too often. There may also be speculation in the sense that store buffer prefetch may sent requests for junior stores which haven't retired, at the cost of a useless request if the store ends up being on a bad path. I'm not actually sure if current implementations do that.


1 This behavior actually depends on the details of the L1 cache: current Intel VIPT implementations allow multiple virutal aliases of the same single line to all live happily in L1. Current AMD Zen implementations use a different implementation (micro-tags) which don't allow the L1 to logically contain multiple virtual aliases, so I would expect Zen to miss to L2 in this case.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!