What is the x86 “ret” instruction equivalent to?

前端 未结 5 1921
独厮守ぢ
独厮守ぢ 2020-12-08 16:27

Say I\'m writing a routine in x86 assembly, like, \"add\" which adds two numbers passed as arguments.

For the most part this is a very simple method:



        
相关标签:
5条回答
  • 2020-12-08 16:52

    This does not need any free registers to simulate ret, but it needs 4 bytes of memory (a dword). Uses indirect jmp. Edit: As noted by Ira Baxter, this code is not reentrant. Works fine in single-threaded code. Will crash if used in multithreaded code.

    push ebp
    mov  ebp, esp
    mov  eax, [ebp+8]
    add  eax, [ebp+12]
    mov  ebp, [ebp+4]
    mov  [return_address], ebp
    pop  ebp
    
    add  esp,4
    jmp  [return_address]
    
    .data
    return_address dd 0
    

    To replace only the ret instruction, without changing the rest of the code. Not reentrant. Do not use in multithreaded code. Edit: fixed bug in below code.

    push ebp
    mov  ebp, esp
    mov  ebp, [ebp+4]
    mov  [return_address], ebp
    pop  ebp
    
    add  esp,4
    jmp  [return_address]
    
    .data
    return_address dd 0
    
    0 讨论(0)
  • 2020-12-08 16:53

    Some other answers present ideas for avoiding registers entirely. This is slower and usually not needed.

    (Much slower if you don't have a red-zone below ESP/RSP you can use, like the x86-64 System V ABI guarantees for user-space. But no other x86/x86-64 ABIs guarantee a red-zone, so debuggers evaluating a print some_func(123) while stopped at a breakpoint could clobber space below ESP, or a Unix signal handler. See Is it valid to write below ESP? for more about the safety of data below ESP, especially on Windows.)


    In typical 32-bit calling conventions, EAX, ECX, and EDX, are all call-clobbered. (i386 System V, and all of Windows cdecl, stdcall, fastcall, etc.)

    The Irvine32 calling convention has no call-clobbered registers, that's the one case I know of where this won't work.

    So unless you're using a custom calling convention that returns something in ECX, you can safely replace ret with pop ecx/jmp ecx and still produce "the exact same result" and fully obey the calling convention. (64-bit integers are returned in EDX:EAX, so in some functions you can't clobber EDX).

    add:
        mov   eax, [esp+4]
        add   eax, [esp+8]
        ;;ret
        pop   ecx
        jmp   ecx           ; bad performance: misaligns the return address predictor stack
    

    I also removed the stack-frame overhead / noise for readability.

    ret is basically how you write pop eip (or IP / RIP) in x86, so popping into an architectural register and using a register-indirect jump is architecturally equivalent. (But much worse microarchitecturally because of call/ret special handling for branch prediction.)


    To avoid registers, in a function with a stack arg, we can overwrite one of the args. In the standard calling conventions, functions own their incoming args and can use those arg-passing slots as scratch space, even if they're declared as foo(const int a, const int b).

    add:
        mov   eax, [esp+4]    ; arg1
        add   eax, [esp+8]    ; arg2
        ;;ret
        pop   [esp]           ; copy return address to arg1, and do ESP+=4
        jmp   [esp]           ; ESP is pointing to arg1
    

    This wouldn't work for a function with no args, or with only register args. (Except in Windows x64, where you could copy the retaddr into the 32-byte shadow space above the return address.)

    Despite the pseudocode in the Operation section in Intel's ISA manual (https://www.felixcloutier.com/x86/pop) showing DEST ← SS:ESP; happens before ESP += 4, the Description section says "If the ESP register is used as a base register for addressing a destination operand in memory, the POP instruction computes the effective address of the operand after it increments the ESP register." Also that "POP ESP increments the stack pointer (ESP) before data at the old top of stack is written into the destination." So it's really tmp = pop ; dst = tmp. AMD doesn't mention either corner-case at all.

    If I'd left in the legacy stack-frame crap with EBP, I could have avoided an [ESP] destination pop, using EBP as a temporary before restoring it. mov ebp, [ebp+4] / mov [esp+8], ebp / pop ebp / add esp,4 / jmp [esp], but that's hardly better or easier to follow. (The saved EBP value is below the return address, and you can't safely move ESP up past it either.) And this temporarily breaks legacy backtraces following a chain of EBP pointing to saved-EBP.

    Or you could save / restore another register to use as a temporary for copying the return address over an arg. But that seems pointless vs. pop [esp] once you sort out exactly what that does.


    Avoiding RET is terrible for performance

    (Unless your caller also avoided call, manually pushing a return address.)

    Mismatched call/ret lead to bad performance for future ret instructions going back up the call-stack in parent functions.

    See Microbenchmarking Return Address Branch Prediction, and also Agner Fog's microarch and optimization guides. Specifically the part that's quoted and discussed in Return address prediction stack buffer vs stack-stored return address?

    (Fun fact: most CPUs special case call +0, because it's not rare for code to use call next_instruction / pop ebx as part of for position-independent 32-bit code to work around the lack of RIP-relative addressing. See the stuffedcow.net blog post.)

    Note that a tailcall like jmp add instead of call add / ret is fine: that doesn't cause a mismatch because the first ret is returning to the most recent call (in the parent of the function that ended with a tailcall). You could look at it as making the body of the 2nd function "part of" the function that did the tailcall, as far as call / ret is concerned.

    0 讨论(0)
  • 2020-12-08 16:58

    Sure.

    push ebp
    mov ebp, esp
    mov eax, [ebp+8]
    add eax, [ebp+12]
    mov esp, ebp
    pop ebp
    
    pop ecx  ; these two instructions simulate "ret"
    jmp ecx
    

    This assumes you have a free register (e.g, ecx). Writing an equivalent that uses "no registers" is possible (after all the x86 is a Turing machine) but is likely to include a lot of convoluted register and stack shuffling.

    Most current OSes offer thread-specific storage accessible by one of the segment registers. You could then simulate "ret" this way, safely:

     pop   gs:preallocated_tls_slot  ; pick one
     jmp   gs:preallocated_tls_slot
    
    0 讨论(0)
  • 2020-12-08 17:01

    Haven't tested, but you may be able to do a ret without using a GPR like this:

    add esp,4
    jmp dword ptr [esp-4]
    
    0 讨论(0)
  • 2020-12-08 17:07

    This is possible to make the return_address an array of dwords and let each thread access return_address at an unique index computed by an one to one injective function of it's unique identifier.

    This change makes nrz's accepted answer works also for multithreaded code as well!

    0 讨论(0)
提交回复
热议问题