AMD has an ABI specification that describes the calling convention to use on x86-64. All OSes follow it, except for Windows which has it\'s own x86-64 calling convention. Wh
IDK why Windows did what they did. See the end of this answer for a guess. I was curious about how the SysV calling convention was decided on, so I dug into the mailing list archive and found some neat stuff.
It's interesting reading some of those old threads on the AMD64 mailing list, since AMD architects were active on it. e.g. Choosing register names was one of the hard parts: AMD considered renaming the original 8 registers r0-r7, or calling the new registers stuff like UAX.
Also, feedback from kernel devs identified things that made the original design of syscall and swapgs unusable. That's how AMD updated the instruction to get this sorted out before releasing any actual chips. It's also interesting that in late 2000, the assumption was that Intel probably wouldn't adopt AMD64.
The SysV (Linux) calling convention, and the decision on how many registers should be callee-preserved vs. caller-save, was made initially in Nov 2000, by Jan Hubicka (a gcc developer). He compiled SPEC2000 and looked at code size and number of instructions. That discussion thread bounces around some of the same ideas as answers and comments on this SO question. In a 2nd thread, he proposed the current sequence as optimal and hopefully final, generating smaller code than some alternatives.
He's using the term "global" to mean call-preserved registers, that have to be push/popped if used.
The choice of rdi
, rsi
, rdx
as the first three args was motivated by:
memset
or other C string function on their args (where gcc inlines a rep string operation?)rbx
is call-preserved because having two call-preserved regs accessible without REX prefixes (rbx and rbp) is a win. Presumably chosen because it's the only other reg that isn't implicitly used by any instruction. (rep string, shift count, and mul/div outputs/inputs touch everything else).We are trying to avoid RCX early in the sequence, since it is register used commonly for special purposes, like EAX, so it has same purpose to be missing in the sequence. Also it can't be used for syscalls and we would like to make syscall sequence to match function call sequence as much as possible.
(background: syscall
/ sysret
unavoidably destroy rcx
(with rip
) and r11
(with RFLAGS
), so the kernel can't see what was originally in rcx
when syscall
ran.)
The kernel system-call ABI was chosen to match the function call ABI, except for r10
instead of rcx
, so a libc wrapper functions like mmap(2)
can just mov %rcx, %r10
/ mov $0x9, %eax
/ syscall
.
Note that the SysV calling convention used by i386 Linux sucks compared to Window's 32bit __vectorcall. It passes everything on the stack, and only returns in edx:eax for int64, not for small structs. It's no surprise little effort was made to maintain compatibility with it. When there's no reason not to, they did things like keeping rbx
call-preserved, since they decided that having another in the original 8 (that don't need a REX prefix) was good.
Making the ABI optimal is much more important long-term than any other consideration. I think they did a pretty good job. I'm not totally sure about returning structs packed into registers, instead of different fields in different regs. I guess code that passes them around by value without actually operating on the fields wins this way, but the extra work of unpacking seems silly. They could have had more integer return registers, more than just rdx:rax
, so returning a struct with 4 members could return them in rdi, rsi, rdx, rax or something.
They considered passing integers in vector regs, because SSE2 can operate on integers. Fortunately they didn't do that. Integers are used as pointer offsets very often, and a round-trip to stack memory is pretty cheap. Also SSE2 instructions take more code bytes than integer instructions.
I suspect Windows ABI designers might have been aiming to minimize differences between 32 and 64bit for the benefit of people that have to port asm from one to the other, or that can use a couple #ifdef
s in some ASM so the same source can more easily build a 32 or 64bit version of a function.
Minimizing changes in the toolchain seems unlikely. An x86-64 compiler needs a separate table of which register is used for what, and what the calling convention is. Having a small overlap with 32bit is unlikely to produce significant savings in toolchain code size / complexity.