How can I prefetch infrequently used code?

拥有回忆 提交于 2019-11-30 23:12:45

The answer depends on your CPU architecture.

That said, if you are using gcc or clang, you can use the __builtin_prefetch instruction to try to generate a prefetch instruction. On Pentium 3 and later x86-type architectures, this will generate a PREFETCHh instruction, which requests a load into the data cache hierarchy. Since these architectures have unified L2 and higher caches, it may help.

The function looks like this:

__builtin_prefetch(const void *address, int locality);

The locality argument should be in the range 0...3. Assuming locality maps directly to the h part of the PREFETCHh instruction, you want to pass 1 or 2, which ask for the data to be loaded into the L2 and higher caches. See Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2B: Instruction Set Reference, M-Z (PDF) page 4-277. (Find other volumes here.)

If you're using another compiler that doesn't have __builtin_prefetch, see whether it has the _mm_prefetch function. You may need to include a header file to get that function. For example, on OS X, that function, and constants for the locality argument, are declared in xmmintrin.h.

There isn't any (official [1] x86) instruction to prefetch code, only data. I find this a rather bizarre use-case, where the code-path is known beforehand, but executes rarely, and there is a significant benefit in prefetching the code. It would be great to understand where you've come to the conclusion that there is a significant benefit in pre-loading the code for this special case, since it would require not only analyzing that the code is significantly slower when it's not been hit for a long time, but also determining that there is spare bus-cycles to actually load the code before the processor can prefetch it by it's normal mechanism for loading code.

You may be able to use the prefetch instructions that fetch into L2, which is typically shared between I- and D-cache.

[1] I know there are some "secret" instructions that allow the processor to manipulate cache-content, but since those would require a lot of extra work, even if you could use them in user-mode code [and I expect this is not some kernel-mode code].

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!