I am in need of such a inline assembly code:
Instead of putting the move into ecx within the assembly code, put the operand in ecx directly:
: : "c"(foo)
The direct use of the stack pointer to reference local variables is probably caused by the use of compiler optimizations. I think you could solve the issue in a couple of ways:
-fno-omit-frame-pointer
in GCC);esp
in the Clobbers so the compiler will be aware that its value is being modified (check your compiler for compatibility).Modifying ESP inside inline-asm should generally be avoided when you have any memory inputs / outputs, so you don't have to disable optimizations or force the compiler to make a stack-frame with EBP some other way. One major advantage is that you (or the compiler) can then use EBP as an extra free register; potentially a significant speedup if you're already having to spill/reload stuff. If you're writing inline asm, presumably this is a hotspot so it's worth spending the extra code-size to use ESP-relative addressing modes.
In x86-64 code, there's an added obstacle to using push/pop safely, because you can't tell the compiler you want to clobber the red-zone below RSP. (You can compile with -mno-red-zone
, but there's no way to disable it from the C source.) You can get problems like this where you clobber the compiler's data on the stack. No 32-bit x86 ABI has a red-zone, though, so this only applies to x86-64 System V. (Or non-x86 ISAs with a red-zone.)
You only need -fno-omit-frame-pointer
for that function if you want to do asm-only stuff like push
as a stack data structure, so there's a variable amount of push. Or maybe if optimizing for code-size.
You can always write a whole non-inline function in asm and put it in a separate file, then you have full control. But only do that if your function is large enough to be worth the call/ret overhead, e.g. if it includes a whole loop; don't make the compiler call
a short non-looping function inside a C inner loop, destroying all the call-clobbered registers and having to make sure globals are in sync.
It seems you're using push
/ pop
inside inline asm because you don't have enough registers, and need to save/reload something. You don't need to use push/pop for save/restore. Instead, use dummy output operands with "=m"
constraints to get the compiler to allocate stack space for you, and use mov
to/from those slots. (Of course you're not limited to mov
; it can be a win to use a memory source operand for an ALU instruction if you only need the value once or twice.)
This may be slightly worse for code-size, but is usually not worse for performance (and can be better). If that's not good enough, write the whole function (or the whole loop) in asm so you don't have to wrestle with the compiler.
int foo(char *p, int a, int b) {
int t1,t2; // dummy output spill slots
int r1,r2; // dummy output tmp registers
int res;
asm ("# operands: %0 %1 %2 %3 %4 %5 %6 %7 %8\n\t"
"imull $123, %[b], %[res]\n\t"
"mov %[res], %[spill1]\n\t"
"mov %[a], %%ecx\n\t"
"mov %[b], %[tmp1]\n\t" // let the compiler allocate tmp regs, unless you need specific regs e.g. for a shift count
"mov %[spill1], %[res]\n\t"
: [res] "=&r" (res),
[tmp1] "=&r" (r1), [tmp2] "=&r" (r2), // early-clobber
[spill1] "=m" (t1), [spill2] "=&rm" (t2) // allow spilling to a register if there are spare regs
, [p] "+&r" (p)
, "+m" (*(char (*)[]) p) // dummy in/output instead of memory clobber
: [a] "rmi" (a), [b] "rm" (b) // a can be an immediate, but b can't
: "ecx"
);
return res;
// p unused in the rest of the function
// so it's really just an input to the asm,
// which the asm is allowed to destroy
}
This compiles to the following asm with gcc7.3 -O3 -m32
on the Godbolt compiler explorer. Note the asm-comment showing what the compiler picked for all the template operands: it picked 12(%esp)
for %[spill1]
and %edi
for %[spill2]
(because I used "=&rm"
for that operand, so the compiler saved/restore %edi
outside the asm, and gave it to us for that dummy operand).
foo(char*, int, int):
pushl %ebp
pushl %edi
pushl %esi
pushl %ebx
subl $16, %esp
movl 36(%esp), %edx
movl %edx, %ebp
#APP
# 19 "/tmp/compiler-explorer-compiler118120-55-w92ge8.v797i/example.cpp" 1
# operands: %eax %ebx %esi 12(%esp) %edi %ebp (%edx) 40(%esp) 44(%esp)
imull $123, 44(%esp), %eax
mov %eax, 12(%esp)
mov 40(%esp), %ecx
mov 44(%esp), %ebx
mov 12(%esp), %eax
# 0 "" 2
#NO_APP
addl $16, %esp
popl %ebx
popl %esi
popl %edi
popl %ebp
ret
Hmm, the dummy memory operand to tell the compiler which memory we modify seems to have resulted in dedicating a register to that, I guess because the p
operand is early-clobber so it can't use the same register. I guess you could risk leaving off the early-clobber if you're confident none of the other inputs will use the same register as p
. (i.e. that they don't have the same value).