Let\'s say I have a function that I plan to execute as part of a benchmark. I want to bring this code into the L1 instruction cache prior to executing since I don\'t want to
One approach that could work for small functions would be to execute some code which appears on the same cache line(s) as your target function, which will bring in the entire cache line.
For example, you could organize your code as follows:
ALIGN 64
function_under_test:
; some code, less than 64 bytes
dummy:
ret
and then call the dummy
function prior to calling function_under_test
- if dummy
starts on the same cache line as the target function, it would bring the entire cache line into L1I. This works for functions of 63 bytes or less1.
This can probably be extended to functions up to ~126 bytes or so by using this trick both at before2 and after the target function. You could extend it to arbitrarily sized functions by inserting dummy functions on every cache line and having the target code jump over them, but this comes at a cost of inserting the otherwise-unnecessary jumps your code under test, and requires careful control over the code size so that the dummy functions are placed correctly.
You need fine control over function alignment and placement to achieve this: assembler is probably the easiest, but you can also probably do it with C or C++ in combination with compiler-specific attributes.
1 You could even reuse the ret
in the function_under_test
itself to support slightly longer functions (e.g., those whose ret
starts within 64 bytes of the start).
2 You'd have to be more careful about the dummy function appearing before the code under test: the processor might fetch instructions past the ret
and it might (?) even execute them. A ud2
after the dummy ret
is likely to block further fetch (but you might want fetch if populating the uop cache is important).