I\'m trying to speed up a variable-bitwidth integer compression scheme and I\'m interested in generating and executing assembly code on-the-fly. Currently a lot of time is spe
Very good question, but the answer is not so easy... Probably the final word will be for the experiment - common case in modern world of different architectures.
Anyway, what you want to do is not exactly self modifying code. The procedures "decode_x" will exists and will not to be modified. So, there should be no problems with the cache.
On the other hand, the memory allocated for the generated code, probably will be dynamically allocated from the heap, so, the addresses will be far enough from the executable code of the program. You can allocate new block every time you need to generate new call sequence.
How far is enough? I think that this is not so far. The distance should be probably a multiply of the cache line of the processor and this way, not so big. I have something like 64bytes (for L1). In the case of dynamically allocated memory you will have many of pages a distance.
The main problem in this approach IMO is that the code of the generated procedures will be executed only once. This way, the program will lost the main advance of the cached memory model - efficient execution of cycling code.
And at the end - the experiment does not look so hard to be made. Just write some test program in both variants and measure the performance. And if you publish these results I will read them carefully. :)
I found some better documentation from Intel and this seemed like the best place to put it for future reference:
Software should avoid writing to a code page in the same 1-KByte
subpage that is being executed or fetching code in the same 2-KByte
subpage of that is being written.
Intel® 64 and IA-32 Architectures Optimization Reference Manual
It's only a partial answer to the questions (test, test, test) but firmer numbers than the other sources I had found.
3.6.9 Mixing Code and Data.
Self-modifying code works correctly, according to the Intel architecture processor requirements, but incurs a significant performance penalty. Avoid self-modifying code if possible. • Placing writable data in the code segment might be impossible to distinguish from self-modifying code. Writable data in the code segment might suffer the same performance penalty as self- modifying code.
Assembly/Compiler Coding Rule 57. (M impact, L generality) If (hopefully read-only) data must occur on the same page as code, avoid placing it immediately after an indirect jump. For example, follow an indirect jump with its mostly likely target, and place the data after an unconditional branch. Tuning Suggestion 1. In rare cases, a performance problem may be caused by executing data on a code page as instructions. This is very likely to happen when execution is following an indirect branch that is not resident in the trace cache. If this is clearly causing a performance problem, try moving the data elsewhere, or inserting an illegal opcode or a PAUSE instruction immediately after the indirect branch. Note that the latter two alternatives may degrade performance in some circumstances.
Assembly/Compiler Coding Rule 58. (H impact, L generality) Always put code and data on separate pages. Avoid self-modifying code wherever possible. If code is to be modified, try to do it all at once and make sure the code that performs the modifications and the code being modified are on separate 4-KByte pages or on separate aligned 1-KByte subpages.
3.6.9.1 Self-modifying Code.
Self-modifying code (SMC) that ran correctly on Pentium III processors and prior implementations will run correctly on subsequent implementations. SMC and cross-modifying code (when multiple processors in a multiprocessor system are writing to a code page) should be avoided when high performance is desired.
Software should avoid writing to a code page in the same 1-KByte subpage that is being executed or fetching code in the same 2-KByte subpage of that is being written. In addition, sharing a page containing directly or speculatively executed code with another processor as a data page can trigger an SMC condition that causes the entire pipeline of the machine and the trace cache to be cleared. This is due to the self-modifying code condition. Dynamic code need not cause the SMC condition if the code written fills up a data page before that page is accessed as code.
Dynamically-modified code (for example, from target fix-ups) is likely to suffer from the SMC condition and should be avoided where possible. Avoid the condition by introducing indirect branches and using data tables on data pages (not code pages) using register-indirect calls.
This doesn't have to be self-modifying code at all - it can be dynamically created code instead, i.e. runtime-generated "trampolines".
Meaning you keep a (global) function pointer around that'll redirect to a writable/executable mapped section of memory - in which you then actively insert the function calls you wish to make.
The main difficulty with this is that call
is IP-relative (as are most jmp
), so that you'll have to calculate the offset between the memory location of your trampoline and the "target funcs". That as such is simple enough - but combine that with 64bit code, and you run into the relative displacement that call
can only deal with displacements in the range of +-2GB, it becomes more complex - you'd need to call through a linkage table.
So you'd essentially create code like (/me severely UN*X biased, hence AT&T assembly, and some references to ELF-isms):
.Lstart_of_modifyable_section:
callq 0f
callq 1f
callq 2f
callq 3f
callq 4f
....
ret
.align 32
0: jmpq tgt0
.align 32
1: jmpq tgt1
.align 32
2: jmpq tgt2
.align 32
3: jmpq tgt3
.align 32
4: jmpq tgt4
.align 32
...
This can be created at compile time (just make a writable text section), or dynamically at runtime.
You then, at runtime, patch the jump targets. That's similar to how the .plt
ELF Section (PLT = procedure linkage table) works - just that there, it's the dynamic linker which patches the jmp slots, while in your case, you do that yourself.
If you go for all runtime, then table like the above is easily creatable through C/C++ even; start with a data structures like:
typedef struct call_tbl_entry __attribute__(("packed")) {
uint8_t call_opcode;
int32_t call_displacement;
};
typedef union jmp_tbl_entry_t {
uint8_t cacheline[32];
struct {
uint8_t jmp_opcode[2]; // 64bit absolute jump
uint64_t jmp_tgtaddress;
} tbl __attribute__(("packed"));
}
struct mytbl {
struct call_tbl_entry calltbl[NUM_CALL_SLOTS];
uint8_t ret_opcode;
union jmp_tbl_entry jmptbl[NUM_CALL_SLOTS];
}
The only critical and somewhat system-dependent thing here is the "packed" nature of this that one needs to tell the compiler about (i.e. not to pad the call
array out), and that one should cacheline-align the jump table.
You need to make calltbl[i].call_displacement = (int32_t)(&jmptbl[i]-&calltbl[i+1])
, initialize the empty/unused jump table with memset(&jmptbl, 0xC3 /* RET */, sizeof(jmptbl))
and then just fill the fields with the jump opcode and target address as you need.
This is less in the scope of SMC and more into Dynamic Binary Optimization, i.e. - you don't really manipulate the code you're running (as in writing new instructions), you can just generate a different piece of code, and reroute the appropriate call in your code to jump there instead. The only modification is at the entry point, and it's only done once, so you don't need to worry too much about the overhead (it usually means flushing all the pipelines to make sure the old instruction isn't still alive anywhere in the machine, i'd guess the penalty is a few hundreds of clock cycles, depending on how loaded the CPU is. Only relevant if it's occurring repeatedly).
In the same sense, you shouldn't worry too much about doing this ahead enough of time. By the way, regarding your question - the CPU would only be able to start executing ahead as far its ROB size, which in haswell is 192 uop (not instructions, but close enough), according to this - http://www.realworldtech.com/haswell-cpu/3/ , and would be able to see slightly further ahead thanks to the predictor and fetch units, so we're talking about overall of let's say a few hundreds).
Having that said, let me reiterate what was said here before - experiment, experiment experiment :)