I want to prefetch some code into the instruction cache. The code path is used infrequently but I need it to be in the instruction cache or at least in L2 for the rare cases that it is used. I have some advance notice of these rare cases. Does _mm_prefetch work for code? Is there a way to get this infrequently used code in cache? For this problem I don't care about portability so even asm would do.
The answer depends on your CPU architecture.
That said, if you are using gcc or clang, you can use the __builtin_prefetch
instruction to try to generate a prefetch instruction. On Pentium 3 and later x86-type architectures, this will generate a PREFETCHh
instruction, which requests a load into the data cache hierarchy. Since these architectures have unified L2 and higher caches, it may help.
The function looks like this:
__builtin_prefetch(const void *address, int locality);
The locality
argument should be in the range 0...3. Assuming locality
maps directly to the h
part of the PREFETCHh
instruction, you want to pass 1 or 2, which ask for the data to be loaded into the L2 and higher caches. See Intel® 64 and IA-32 Architectures Software Developer's Manual
Volume 2B: Instruction Set Reference, M-Z (PDF) page 4-277. (Find other volumes here.)
If you're using another compiler that doesn't have __builtin_prefetch
, see whether it has the _mm_prefetch
function. You may need to include a header file to get that function. For example, on OS X, that function, and constants for the locality
argument, are declared in xmmintrin.h
.
There isn't any (official [1] x86) instruction to prefetch code, only data. I find this a rather bizarre use-case, where the code-path is known beforehand, but executes rarely, and there is a significant benefit in prefetching the code. It would be great to understand where you've come to the conclusion that there is a significant benefit in pre-loading the code for this special case, since it would require not only analyzing that the code is significantly slower when it's not been hit for a long time, but also determining that there is spare bus-cycles to actually load the code before the processor can prefetch it by it's normal mechanism for loading code.
You may be able to use the prefetch
instructions that fetch into L2, which is typically shared between I- and D-cache.
[1] I know there are some "secret" instructions that allow the processor to manipulate cache-content, but since those would require a lot of extra work, even if you could use them in user-mode code [and I expect this is not some kernel-mode code].
来源:https://stackoverflow.com/questions/16218757/how-can-i-prefetch-infrequently-used-code