问题
I wrote a C program that just read/write a large array. I compiled the program with command gcc -O0 program.c -o program
Out of curiosity, I dissemble the C program with objdump -S
command.
The code and assembly of the read_array
and write_array
functions are attached at the end of this question.
I'm trying to interpret how gcc compiles the function. I used //
to add my comments and questions
Take one piece of the beginning of the assembly code of the write_array()
function
4008c1: 48 89 7d e8 mov %rdi,-0x18(%rbp) // this is the first parameter of the fuction
4008c5: 48 89 75 e0 mov %rsi,-0x20(%rbp) // this is the second parameter of the fuction
4008c9: c6 45 ff 01 movb $0x1,-0x1(%rbp) // comparing with the source code, I think this is the `char tmp` variable
4008cd: c7 45 f8 00 00 00 00 movl $0x0,-0x8(%rbp) // this should be the `int i` variable.
What I don't understand is:
1) char tmp
is obviously defined after int i
in write_array
function. Why gcc reorder the memory location of these two local variables?
2) From the offset, int i
is at -0x8(%rbp)
and char tmp
is at -0x1(%rbp)
, which indicates variable int i
takes 7 bytes? This is quite weird because int i
should be 4 bytes on x86-64 machine. Isn't it? My speculation is that gcc tries to do some alignment?
3) I found the gcc optimization choices are quite interesting. Is there some good documents/book that explain how gcc works? (The third question may be off-topic, and if you think so, please just ignore. I just try to see if there is some short cut to learn the underlying mechanisms gcc uses for compilation. :-) )
Below is the piece of function code:
#define CACHE_LINE_SIZE 64
static inline void
read_array(char* array, long size)
{
int i;
char tmp;
for ( i = 0; i < size; i+= CACHE_LINE_SIZE )
{
tmp = array[i];
}
return;
}
static inline void
write_array(char* array, long size)
{
int i;
char tmp = 1;
for ( i = 0; i < size; i+= CACHE_LINE_SIZE )
{
array[i] = tmp;
}
return;
}
Below is the piece of disassembled code for write_array
, from gcc -O0:
00000000004008bd <write_array>:
4008bd: 55 push %rbp
4008be: 48 89 e5 mov %rsp,%rbp
4008c1: 48 89 7d e8 mov %rdi,-0x18(%rbp)
4008c5: 48 89 75 e0 mov %rsi,-0x20(%rbp)
4008c9: c6 45 ff 01 movb $0x1,-0x1(%rbp)
4008cd: c7 45 f8 00 00 00 00 movl $0x0,-0x8(%rbp)
4008d4: eb 13 jmp 4008e9 <write_array+0x2c>
4008d6: 8b 45 f8 mov -0x8(%rbp),%eax
4008d9: 48 98 cltq
4008db: 48 03 45 e8 add -0x18(%rbp),%rax
4008df: 0f b6 55 ff movzbl -0x1(%rbp),%edx
4008e3: 88 10 mov %dl,(%rax)
4008e5: 83 45 f8 40 addl $0x40,-0x8(%rbp)
4008e9: 8b 45 f8 mov -0x8(%rbp),%eax
4008ec: 48 98 cltq
4008ee: 48 3b 45 e0 cmp -0x20(%rbp),%rax
4008f2: 7c e2 jl 4008d6 <write_array+0x19>
4008f4: 5d pop %rbp
4008f5: c3 retq
回答1:
Even at -O0
, gcc doesn't emit definitions for static inline
functios unless there's a caller. In that case, it doesn't actually inline: instead it emits a stand-alone definition. So I guess your disassembly is from that.
Are you using a really old gcc version? gcc 4.6.4 puts the vars in that order on the stack, but 4.7.3 and later use the other order:
movb $1, -5(%rbp) #, tmp
movl $0, -4(%rbp) #, i
In your asm, they're stored in order of initialization rather than declaration, but I think that's just by chance, since the order changed with gcc 4.7. Also, tacking on an initializers like int i=1;
doesn't change the allocation order, so that completely torpedoes that theory.
Remember that gcc is designed around a series of transformations from source to asm, so -O0 doesn't mean "no optimization". You should think of -O0
as leaving out some things that -O3
normally does. There is no option that tries to make a literal-as-possible translation from source to asm.
Once gcc does decide which order to allocate space for them:
the
char
atrbp-1
: That's the first available location that can hold achar
. If there was anotherchar
that needed storing, it could go atrbp-2
.the
int
atrbp-8
: Since the 4 bytes fromrbp-1
torbp-4
isn't free, the next available naturally-aligned location isrbp-8
.
Or with gcc 4.7 and newer, -4 is the first available spot for an int, and -5 is the next byte below that.
RE: space saving:
It's true that putting the char at -5 makes the lowest touched address %rsp-5
, instead of %rsp-8
, but this doesn't save anything.
The stack pointer is 16B-aligned in the AMD64 SysV ABI. (Technically, %rsp+8
(the start of stack args) is aligned on function entry, before you push anything.) The only way for %rbp-8
to touch a new page or cache-line that %rbp-5
wouldn't is for the stack to be less than 4B-aligned. This is extremely unlikely, even in 32bit code.
As far as how much stack is "allocated" or "owned" by the function: In the AMD64 SysV ABI, the function "owns" the red zone of 128B below %rsp
(That size was chosen because a one-byte displacement can go up to -128). Signal handlers and any other asynchronous users of the user-space stack will avoid clobbering the red zone, which is why the function can write to memory below %rsp
without decrementing %rsp
. So from that perspective, it doesn't matter how much of the red-zone we use; the chances of a signal handler running out of stack is unaffected.
In 32bit code, where there's no redzone, for either order gcc reserves space on the stack with sub $16, %esp
. (try with -m32
on godbolt). So again, it doesn't matter whether we use 5 or 8 bytes, because we reserve in units of 16.
When there are many char
and int
variables, gcc packs the char
s into 4B groups, instead of losing space to fragmentation, even when the declarations are mixed together:
void many_vars(void) {
char tmp = 1; int i=1;
char t2 = 2; int i2 = 2;
char t3 = 3; int i3 = 3;
char t4 = 4;
}
with gcc 4.6.4 -O0 -fverbose-asm, which helpfully labels which store is for which variable, which is why compiler asm output is preferable to disassembly:
pushq %rbp #
movq %rsp, %rbp #,
movb $1, -4(%rbp) #, tmp
movl $1, -16(%rbp) #, i
movb $2, -3(%rbp) #, t2
movl $2, -12(%rbp) #, i2
movb $3, -2(%rbp) #, t3
movl $3, -8(%rbp) #, i3
movb $4, -1(%rbp) #, t4
popq %rbp #
ret
I think variables go in either forward or reverse order of declaration, depending on gcc version, at -O0
.
I made a version of your read_array
function that works with optimization on:
// assumes that size is non-zero. Use a while() instead of do{}while() if you want extra code to check for that case.
void read_array_good(const char* array, size_t size) {
const volatile char *vp = array;
do {
(void) *vp; // this counts as accessing the volatile memory, with gcc/clang at least
vp += CACHE_LINE_SIZE/sizeof(vp[0]);
} while (vp < array+size);
}
Compiles to the following, with gcc 5.3 -O3 -march=haswell:
addq %rdi, %rsi # array, D.2434
.L11:
movzbl (%rdi), %eax # MEM[(const char *)array_1], D.2433
addq $64, %rdi #, array
cmpq %rsi, %rdi # D.2434, array
jb .L11 #,
ret
Casting an expression to void is the canonical way to tell the compiler that a value is used. e.g. to suppress unused-variable warnings, you can write (void)my_unused_var;
.
For gcc and clang, doing that with a volatile
pointer dereference does generate a memory access, with no need for a tmp variable. The C standard is very non-specific about what constitutes access to something that's volatile
, so this probably isn't perfectly portable. Another way is to xor
the values you read into an accumulator, and then store that to a global. As long as you don't use whole-program optimization, the compiler doesn't know that nothing reads the global, so it can't optimize away the calculation.
See the vmtouch source code for an example of this second technique. (It actually uses a global variable for the accumulator, which makes clunky code. Of course, that hardly matters since it's touching pages, not just cache lines, so it very quickly bottlenecks on TLB misses and page faults, even with a memory read-modify-write in the loop-carried dependency chain.)
I tried and failed to write something that gcc or clang would compile to a function with no prologue (which assumes that size
is initially non-zero). GCC always wants to add rsi,rdi
for a cmp/jcc
loop condition, even with -march=haswell
where sub rsi,64
/jae
can macro-fuse just as well as cmp/jcc
. But in general on AMD, what GCC has fewer uops inside the loop.
read_array_handtuned_haswell:
.L0
movzx eax, byte [rdi] ; overwrite the full RAX to avoid any partial-register false deps from writing AL
add rdi, 64
sub rsi, 64
jae .L0 ; or ja, depending on what semantics you want
ret
Godbolt Compiler Explorer link with all my attempts and trial versions
I can get similar if the loop-termination condition is je
, in a loop like do { ... } while( size -= CL_SIZE );
But I can't seem to convince gcc to catch unsigned borrow when subtracting. It want to subtract and then cmp -64/jb
to detect underflow. It's not that hard to get compilers to check the carry flag after an add to detect carry :/
It's also easy to get compilers to make a 4-insn loop, but not without prologue. e.g. calculate an end pointer (array+size) and increment a pointer until it's greater or equal.
Fortunately this is not a big deal; the loop we do get is good.
回答2:
For local variable saved in stack, the address order depends in the stack grow direction. you can refer to Does stack grow upward or downward? for more information.
This is quite weird because int i should be 4 bytes on x86-64 machine. Isn't it?
If my memory serve me correctly, the size of int on x86-64 machine is 8. you can confirm it by writing a test application to print sizeof(int)
.
来源:https://stackoverflow.com/questions/36298567/why-does-gcc-reorder-the-local-variable-in-function