Say I\'m writing a routine in x86 assembly, like, \"add\" which adds two numbers passed as arguments.
For the most part this is a very simple method:
This does not need any free registers to simulate ret
, but it needs 4 bytes of memory (a dword). Uses indirect jmp
. Edit: As noted by Ira Baxter, this code is not reentrant. Works fine in single-threaded code. Will crash if used in multithreaded code.
push ebp mov ebp, esp mov eax, [ebp+8] add eax, [ebp+12] mov ebp, [ebp+4] mov [return_address], ebp pop ebp add esp,4 jmp [return_address] .data return_address dd 0
To replace only the ret
instruction, without changing the rest of the code. Not reentrant. Do not use in multithreaded code. Edit: fixed bug in below code.
push ebp mov ebp, esp mov ebp, [ebp+4] mov [return_address], ebp pop ebp add esp,4 jmp [return_address] .data return_address dd 0
Some other answers present ideas for avoiding registers entirely. This is slower and usually not needed.
(Much slower if you don't have a red-zone below ESP/RSP you can use, like the x86-64 System V ABI guarantees for user-space. But no other x86/x86-64 ABIs guarantee a red-zone, so debuggers evaluating a print some_func(123)
while stopped at a breakpoint could clobber space below ESP, or a Unix signal handler. See Is it valid to write below ESP? for more about the safety of data below ESP, especially on Windows.)
In typical 32-bit calling conventions, EAX, ECX, and EDX, are all call-clobbered. (i386 System V, and all of Windows cdecl, stdcall, fastcall, etc.)
The Irvine32 calling convention has no call-clobbered registers, that's the one case I know of where this won't work.
So unless you're using a custom calling convention that returns something in ECX, you can safely replace ret
with pop ecx
/jmp ecx
and still produce "the exact same result" and fully obey the calling convention. (64-bit integers are returned in EDX:EAX, so in some functions you can't clobber EDX).
add:
mov eax, [esp+4]
add eax, [esp+8]
;;ret
pop ecx
jmp ecx ; bad performance: misaligns the return address predictor stack
I also removed the stack-frame overhead / noise for readability.
ret
is basically how you write pop eip
(or IP / RIP) in x86, so popping into an architectural register and using a register-indirect jump is architecturally equivalent. (But much worse microarchitecturally because of call
/ret
special handling for branch prediction.)
To avoid registers, in a function with a stack arg, we can overwrite one of the args. In the standard calling conventions, functions own their incoming args and can use those arg-passing slots as scratch space, even if they're declared as foo(const int a, const int b)
.
add:
mov eax, [esp+4] ; arg1
add eax, [esp+8] ; arg2
;;ret
pop [esp] ; copy return address to arg1, and do ESP+=4
jmp [esp] ; ESP is pointing to arg1
This wouldn't work for a function with no args, or with only register args. (Except in Windows x64, where you could copy the retaddr into the 32-byte shadow space above the return address.)
Despite the pseudocode in the Operation section in Intel's ISA manual (https://www.felixcloutier.com/x86/pop) showing DEST ← SS:ESP;
happens before ESP += 4
, the Description section says "If the ESP register is used as a base register for addressing a destination operand in memory, the POP instruction computes the effective address of the operand after it increments the ESP register." Also that "POP ESP increments the stack pointer (ESP) before data at the old top of stack is written into the destination." So it's really tmp = pop
; dst = tmp
. AMD doesn't mention either corner-case at all.
If I'd left in the legacy stack-frame crap with EBP, I could have avoided an [ESP]
destination pop, using EBP as a temporary before restoring it. mov ebp, [ebp+4]
/ mov [esp+8], ebp
/ pop ebp
/ add esp,4
/ jmp [esp]
, but that's hardly better or easier to follow. (The saved EBP value is below the return address, and you can't safely move ESP up past it either.) And this temporarily breaks legacy backtraces following a chain of EBP pointing to saved-EBP.
Or you could save / restore another register to use as a temporary for copying the return address over an arg. But that seems pointless vs. pop [esp]
once you sort out exactly what that does.
(Unless your caller also avoided call
, manually pushing a return address.)
Mismatched call/ret lead to bad performance for future ret
instructions going back up the call-stack in parent functions.
See Microbenchmarking Return Address Branch Prediction, and also Agner Fog's microarch and optimization guides. Specifically the part that's quoted and discussed in Return address prediction stack buffer vs stack-stored return address?
(Fun fact: most CPUs special case call +0
, because it's not rare for code to use call next_instruction
/ pop ebx
as part of for position-independent 32-bit code to work around the lack of RIP-relative addressing. See the stuffedcow.net blog post.)
Note that a tailcall like jmp add
instead of call add
/ ret
is fine: that doesn't cause a mismatch because the first ret
is returning to the most recent call
(in the parent of the function that ended with a tailcall). You could look at it as making the body of the 2nd function "part of" the function that did the tailcall, as far as call
/ ret
is concerned.
Sure.
push ebp
mov ebp, esp
mov eax, [ebp+8]
add eax, [ebp+12]
mov esp, ebp
pop ebp
pop ecx ; these two instructions simulate "ret"
jmp ecx
This assumes you have a free register (e.g, ecx). Writing an equivalent that uses "no registers" is possible (after all the x86 is a Turing machine) but is likely to include a lot of convoluted register and stack shuffling.
Most current OSes offer thread-specific storage accessible by one of the segment registers. You could then simulate "ret" this way, safely:
pop gs:preallocated_tls_slot ; pick one
jmp gs:preallocated_tls_slot
Haven't tested, but you may be able to do a ret without using a GPR like this:
add esp,4
jmp dword ptr [esp-4]
This is possible to make the return_address
an array of dword
s and let each thread access return_address
at an unique index computed by an one to one injective function of it's unique identifier.
This change makes nrz's accepted answer works also for multithreaded code as well!