Inline assembly that clobbers the red zone

后端 未结 5 1268
迷失自我
迷失自我 2020-12-03 10:19

I\'m writing a cryptography program, and the core (a wide multiply routine) is written in x86-64 assembly, both for speed and because it extensively uses instructions like <

相关标签:
5条回答
  • 2020-12-03 10:59

    Not sure but looking at GCC documentation for function attributes, I found the stdcall function attribute which might be of interest.

    I'm still wondering what you find problematic with your asm call version. If it's just aesthetics, you could transform it into a macro, or a inline function.

    0 讨论(0)
  • 2020-12-03 11:13

    Can't you just modify your assembly function to meet the requirements of a signal in the x86-64 ABI by shifting the stack pointer by 128 bytes on entry to your function?

    Or if you are referring to the return pointer itself, put the shift into your call macro (so sub %rsp; call...)

    0 讨论(0)
  • 2020-12-03 11:23

    What about creating a dummy function that is written in C and does nothing but call the inline assembly?

    0 讨论(0)
  • 2020-12-03 11:25

    The max-performance way might be to write the whole inner loop in asm (including the call instructions, if it's really worth it to unroll but not inline. Certainly plausible if fully inlining is causing too many uop-cache misses elsewhere).

    Anyway, have C call an asm function containing your optimized loop.

    BTW, clobbering all the registers makes it hard for gcc to make a very good loop, so you might well come out ahead from optimizing the whole loop yourself. (e.g. maybe keep a pointer in a register, and an end-pointer in memory, because cmp mem,reg is still fairly efficient).

    Have a look at the code gcc/clang wrap around an asm statement that modifies an array element (on Godbolt):

    void testloop(long *p, long count) {
      for (long i = 0 ; i < count ; i++) {
        asm("  #    XXX  asm operand in %0"
        : "+r" (p[i])
        :
        : // "rax",
         "rbx", "rcx", "rdx", "rdi", "rsi", "rbp",
          "r8", "r9", "r10", "r11", "r12","r13","r14","r15"
        );
      }
    }
    
    #gcc7.2 -O3 -march=haswell
    
        push registers and other function-intro stuff
        lea     rcx, [rdi+rsi*8]      ; end-pointer
        mov     rax, rdi
       
        mov     QWORD PTR [rsp-8], rcx    ; store the end-pointer
        mov     QWORD PTR [rsp-16], rdi   ; and the start-pointer
    
    .L6:
        # rax holds the current-position pointer on loop entry
        # also stored in [rsp-16]
        mov     rdx, QWORD PTR [rax]
        mov     rax, rdx                 # looks like a missed optimization vs. mov rax, [rax], because the asm clobbers rdx
    
             XXX  asm operand in rax
    
        mov     rbx, QWORD PTR [rsp-16]   # reload the pointer
        mov     QWORD PTR [rbx], rax
        mov     rax, rbx            # another weird missed-optimization (lea rax, [rbx+8])
        add     rax, 8
        mov     QWORD PTR [rsp-16], rax
        cmp     QWORD PTR [rsp-8], rax
        jne     .L6
    
      # cleanup omitted.
    

    clang counts a separate counter down towards zero. But it uses load / add -1 / store instead of a memory-destination add [mem], -1 / jnz.

    You can probably do better than this if you write the whole loop yourself in asm instead of leaving that part of your hot loop to the compiler.

    Consider using some XMM registers for integer arithmetic to reduce register pressure on the integer registers, if possible. On Intel CPUs, moving between GP and XMM registers only costs 1 ALU uop with 1c latency. (It's still 1 uop on AMD, but higher latency especially on Bulldozer-family). Doing scalar integer stuff in XMM registers is not much worse, and could be worth it if total uop throughput is your bottleneck, or it saves more spill/reloads than it costs.

    But of course XMM is not very viable for loop counters (paffffd/pcmpeq/pmovmskb/cmp/jcc or psubd/ptest/jcc are not great compared to sub [mem], 1 / jcc), or for pointers, or for extended-precision arithmetic (manually doing carry-out with a compare and carry-in with another paddq sucks even in 32-bit mode where 64-bit integer regs aren't available). It's usually better to spill/reload to memory instead of XMM registers, if you're not bottlenecked on load/store uops.


    If you also need calls to the function from outside the loop (cleanup or something), write a wrapper or use add $-128, %rsp ; call ; sub $-128, %rsp to preserve the red-zone in those versions. (Note that -128 is encodeable as an imm8 but +128 isn't.)

    Including an actual function call in your C function doesn't necessarily make it safe to assume the red-zone is unused, though. Any spill/reload between (compiler-visible) function calls could use the red-zone, so clobbering all the registers in an asm statement is quite likely to trigger that behaviour.

    // a non-leaf function that still uses the red-zone with gcc
    void bar(void) {
      //cryptofunc(1);  // gcc/clang don't use the redzone after this (not future-proof)
    
      volatile int tmp = 1;
      (void)tmp;
      cryptofunc(1);  // but gcc will use the redzone before a tailcall
    }
    
    # gcc7.2 -O3 output
        mov     edi, 1
        mov     DWORD PTR [rsp-12], 1
        mov     eax, DWORD PTR [rsp-12]
        jmp     cryptofunc(long)
    

    If you want to depend on compiler-specific behaviour, you could call (with regular C) a non-inline function before the hot loop. With current gcc / clang, that will make them reserve enough stack space since they have to adjust the stack anyway (to align rsp before a call). This is not future-proof at all, but should happen to work.


    GNU C has an __attribute__((target("options"))) x86 function attribute, but it's not usable for arbitrary options, and -mno-red- zone is not one of the ones you can toggle on a per-function basis, or with #pragma GCC target ("options") within a compilation unit.

    You can use stuff like

    __attribute__(( target("sse4.1,arch=core2") ))
    void penryn_version(void) {
      ...
    }
    

    but not __attribute__(( target("mno-red-zone") )).

    There's a #pragma GCC optimize and an optimize function-attribute (both of which are not intended for production code), but #pragma GCC optimize ("-mno-red-zone") doesn't work either. I think the idea is to let some important functions be optimized with -O2 even in debug builds. You can set -f options or -O.

    You could put the function in a file by itself and compile that compilation unit with -mno-red-zone, though. (And hopefully LTO will not break anything...)

    0 讨论(0)
  • 2020-12-03 11:26

    From your original question I did not realize gcc limited red-zone use to leaf functions. I don't think that's required by the x86_64 ABI, but it is a reasonable simplifying assumption for a compiler. In that case you only need to make the function calling your assembly routine a non-leaf for purposes of compilation:

    int global;
    
    was_leaf()
    {
        if (global) other();
    }
    

    GCC can't tell if global will be true, so it can't optimize away the call to other() so was_leaf() is not a leaf function anymore. I compiled this (with more code that triggered stack usage) and observed that as a leaf it did not move %rsp and with the modification shown it did.

    I also tried simply allocating more than 128 bytes (just char buf[150]) in a leaf but I was shocked to see it only did a partial subtraction:

        pushq   %rbp
        movq    %rsp, %rbp
        subq    $40, %rsp
        movb    $7, -155(%rbp)
    

    If I put the leaf-defeating code back in that becomes subq $160, %rsp

    0 讨论(0)
提交回复
热议问题