Why does gcc reorder the local variable in function?

懵懂的女人 提交于 2019-12-18 18:03:23

问题


I wrote a C program that just read/write a large array. I compiled the program with command gcc -O0 program.c -o program Out of curiosity, I dissemble the C program with objdump -S command.

The code and assembly of the read_array and write_array functions are attached at the end of this question.

I'm trying to interpret how gcc compiles the function. I used // to add my comments and questions

Take one piece of the beginning of the assembly code of the write_array() function

  4008c1:   48 89 7d e8             mov    %rdi,-0x18(%rbp) // this is the first parameter of the fuction
  4008c5:   48 89 75 e0             mov    %rsi,-0x20(%rbp) // this is the second parameter of the fuction
  4008c9:   c6 45 ff 01             movb   $0x1,-0x1(%rbp) // comparing with the source code, I think this is the `char tmp` variable 
  4008cd:   c7 45 f8 00 00 00 00    movl   $0x0,-0x8(%rbp) // this should be the `int i` variable.

What I don't understand is:

1) char tmp is obviously defined after int i in write_array function. Why gcc reorder the memory location of these two local variables?

2) From the offset, int i is at -0x8(%rbp) and char tmp is at -0x1(%rbp), which indicates variable int i takes 7 bytes? This is quite weird because int i should be 4 bytes on x86-64 machine. Isn't it? My speculation is that gcc tries to do some alignment?

3) I found the gcc optimization choices are quite interesting. Is there some good documents/book that explain how gcc works? (The third question may be off-topic, and if you think so, please just ignore. I just try to see if there is some short cut to learn the underlying mechanisms gcc uses for compilation. :-) )

Below is the piece of function code:

#define CACHE_LINE_SIZE 64
static inline void
read_array(char* array, long size)
{
    int i;
    char tmp;
    for ( i = 0; i < size; i+= CACHE_LINE_SIZE )
    {
        tmp = array[i];
    }
    return;
}

static inline void
write_array(char* array, long size)
{
    int i;
    char tmp = 1;
    for ( i = 0; i < size; i+= CACHE_LINE_SIZE )
    {
        array[i] = tmp;
    }
    return;
}

Below is the piece of disassembled code for write_array, from gcc -O0:

00000000004008bd <write_array>:
  4008bd:   55                      push   %rbp
  4008be:   48 89 e5                mov    %rsp,%rbp
  4008c1:   48 89 7d e8             mov    %rdi,-0x18(%rbp)
  4008c5:   48 89 75 e0             mov    %rsi,-0x20(%rbp)
  4008c9:   c6 45 ff 01             movb   $0x1,-0x1(%rbp)
  4008cd:   c7 45 f8 00 00 00 00    movl   $0x0,-0x8(%rbp)
  4008d4:   eb 13                   jmp    4008e9 <write_array+0x2c>
  4008d6:   8b 45 f8                mov    -0x8(%rbp),%eax
  4008d9:   48 98                   cltq
  4008db:   48 03 45 e8             add    -0x18(%rbp),%rax
  4008df:   0f b6 55 ff             movzbl -0x1(%rbp),%edx
  4008e3:   88 10                   mov    %dl,(%rax)
  4008e5:   83 45 f8 40             addl   $0x40,-0x8(%rbp)
  4008e9:   8b 45 f8                mov    -0x8(%rbp),%eax
  4008ec:   48 98                   cltq
  4008ee:   48 3b 45 e0             cmp    -0x20(%rbp),%rax
  4008f2:   7c e2                   jl     4008d6 <write_array+0x19>
  4008f4:   5d                      pop    %rbp
  4008f5:   c3                      retq

回答1:


Even at -O0, gcc doesn't emit definitions for static inline functios unless there's a caller. In that case, it doesn't actually inline: instead it emits a stand-alone definition. So I guess your disassembly is from that.


Are you using a really old gcc version? gcc 4.6.4 puts the vars in that order on the stack, but 4.7.3 and later use the other order:

    movb    $1, -5(%rbp)    #, tmp
    movl    $0, -4(%rbp)    #, i

In your asm, they're stored in order of initialization rather than declaration, but I think that's just by chance, since the order changed with gcc 4.7. Also, tacking on an initializers like int i=1; doesn't change the allocation order, so that completely torpedoes that theory.

Remember that gcc is designed around a series of transformations from source to asm, so -O0 doesn't mean "no optimization". You should think of -O0 as leaving out some things that -O3 normally does. There is no option that tries to make a literal-as-possible translation from source to asm.

Once gcc does decide which order to allocate space for them:

  • the char at rbp-1: That's the first available location that can hold a char. If there was another char that needed storing, it could go at rbp-2.

  • the int at rbp-8: Since the 4 bytes from rbp-1 to rbp-4 isn't free, the next available naturally-aligned location is rbp-8.

Or with gcc 4.7 and newer, -4 is the first available spot for an int, and -5 is the next byte below that.


RE: space saving:

It's true that putting the char at -5 makes the lowest touched address %rsp-5, instead of %rsp-8, but this doesn't save anything.

The stack pointer is 16B-aligned in the AMD64 SysV ABI. (Technically, %rsp+8 (the start of stack args) is aligned on function entry, before you push anything.) The only way for %rbp-8 to touch a new page or cache-line that %rbp-5 wouldn't is for the stack to be less than 4B-aligned. This is extremely unlikely, even in 32bit code.

As far as how much stack is "allocated" or "owned" by the function: In the AMD64 SysV ABI, the function "owns" the red zone of 128B below %rsp (That size was chosen because a one-byte displacement can go up to -128). Signal handlers and any other asynchronous users of the user-space stack will avoid clobbering the red zone, which is why the function can write to memory below %rsp without decrementing %rsp. So from that perspective, it doesn't matter how much of the red-zone we use; the chances of a signal handler running out of stack is unaffected.

In 32bit code, where there's no redzone, for either order gcc reserves space on the stack with sub $16, %esp. (try with -m32 on godbolt). So again, it doesn't matter whether we use 5 or 8 bytes, because we reserve in units of 16.

When there are many char and int variables, gcc packs the chars into 4B groups, instead of losing space to fragmentation, even when the declarations are mixed together:

void many_vars(void) {
  char tmp = 1;  int i=1;
  char t2 = 2;   int i2 = 2;
  char t3 = 3;   int i3 = 3;
  char t4 = 4;
}

with gcc 4.6.4 -O0 -fverbose-asm, which helpfully labels which store is for which variable, which is why compiler asm output is preferable to disassembly:

    pushq   %rbp  #
    movq    %rsp, %rbp      #,
    movb    $1, -4(%rbp)    #, tmp
    movl    $1, -16(%rbp)   #, i
    movb    $2, -3(%rbp)    #, t2
    movl    $2, -12(%rbp)   #, i2
    movb    $3, -2(%rbp)    #, t3
    movl    $3, -8(%rbp)    #, i3
    movb    $4, -1(%rbp)    #, t4
    popq    %rbp    #
    ret

I think variables go in either forward or reverse order of declaration, depending on gcc version, at -O0.


I made a version of your read_array function that works with optimization on:

// assumes that size is non-zero.  Use a while() instead of do{}while() if you want extra code to check for that case.
void read_array_good(const char* array, size_t size) {
    const volatile char *vp = array;
    do {
      (void) *vp;    // this counts as accessing the volatile memory, with gcc/clang at least
      vp += CACHE_LINE_SIZE/sizeof(vp[0]);
    } while (vp < array+size);
}

Compiles to the following, with gcc 5.3 -O3 -march=haswell:

        addq    %rdi, %rsi      # array, D.2434
.L11:
        movzbl  (%rdi), %eax        # MEM[(const char *)array_1], D.2433
        addq    $64, %rdi       #, array
        cmpq    %rsi, %rdi      # D.2434, array
        jb      .L11        #,
        ret

Casting an expression to void is the canonical way to tell the compiler that a value is used. e.g. to suppress unused-variable warnings, you can write (void)my_unused_var;.

For gcc and clang, doing that with a volatile pointer dereference does generate a memory access, with no need for a tmp variable. The C standard is very non-specific about what constitutes access to something that's volatile, so this probably isn't perfectly portable. Another way is to xor the values you read into an accumulator, and then store that to a global. As long as you don't use whole-program optimization, the compiler doesn't know that nothing reads the global, so it can't optimize away the calculation.

See the vmtouch source code for an example of this second technique. (It actually uses a global variable for the accumulator, which makes clunky code. Of course, that hardly matters since it's touching pages, not just cache lines, so it very quickly bottlenecks on TLB misses and page faults, even with a memory read-modify-write in the loop-carried dependency chain.)


I tried and failed to write something that gcc or clang would compile to a function with no prologue (which assumes that size is initially non-zero). GCC always wants to add rsi,rdi for a cmp/jcc loop condition, even with -march=haswell where sub rsi,64/jae can macro-fuse just as well as cmp/jcc. But in general on AMD, what GCC has fewer uops inside the loop.

read_array_handtuned_haswell:
.L0
    movzx   eax, byte [rdi]     ; overwrite the full RAX to avoid any partial-register false deps from writing AL
    add     rdi, 64
    sub     rsi, 64
    jae     .L0           ; or ja, depending on what semantics you want
    ret

Godbolt Compiler Explorer link with all my attempts and trial versions

I can get similar if the loop-termination condition is je, in a loop like do { ... } while( size -= CL_SIZE ); But I can't seem to convince gcc to catch unsigned borrow when subtracting. It want to subtract and then cmp -64/jb to detect underflow. It's not that hard to get compilers to check the carry flag after an add to detect carry :/

It's also easy to get compilers to make a 4-insn loop, but not without prologue. e.g. calculate an end pointer (array+size) and increment a pointer until it's greater or equal.

Fortunately this is not a big deal; the loop we do get is good.




回答2:


For local variable saved in stack, the address order depends in the stack grow direction. you can refer to Does stack grow upward or downward? for more information.

This is quite weird because int i should be 4 bytes on x86-64 machine. Isn't it?

If my memory serve me correctly, the size of int on x86-64 machine is 8. you can confirm it by writing a test application to print sizeof(int).



来源:https://stackoverflow.com/questions/36298567/why-does-gcc-reorder-the-local-variable-in-function

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!