I am trying to set arguments using assembly code that are used in a generic function. The arguments of this generic function - that is resident in a dll - are not known duri
The x86-64 Windows calling convention is fairly simple, and makes it possible to write a wrapper function that doesn't know the types of anything. Just load the first 32 bytes of args into registers, and copy the rest to the stack.
You definitely need to make the function call from asm; It can't possibly work reliably to make a bunch of function calls like fill_register_xmm0
and hope that the compiler doesn't clobber any of those registers. The C compiler emits instructions that use the registers, as part of its normal job, including passing args to functions like fill_register_xmm0
.
The only alternative would be to write a C statement with a function call with all the args having the correct type, to get the compiler to emit code to make a function call normally. If there are only a few possible different combinations of args, putting those in if()
blocks might be good.
And BTW, movsd xmm0, variable
probably assembles to movsd xmm0, xmm0
, because the first function arg is passed in XMM0 if it's FP.
In C, prepare a buffer with the args (like in the 32-bit case).
Each one needs to be padded to 8 bytes if it's narrower. See MS's docs for x86-64 __fastcall. (Note that x86-64 __vectorcall
passes __m128
args by value in registers, but for __fastcall
it's strictly true that the args form an array of 8-byte values, after the register args. And storing those into the shadow space creates a full array of all the args.)
Any argument that doesn’t fit in 8 bytes, or is not 1, 2, 4, or 8 bytes, must be passed by reference. There is no attempt to spread a single argument across multiple registers.
But the key thing that makes variadic functions easy in the Windows calling convention also works here: The register used for the 2nd arg doesn't depend on the type of the first. i.e. if an FP arg is the first arg, then that uses up an integer register arg-passing slot. So you can only have up to 4 register args, not 4 integer and 4 FP.
If the 4th arg is integer, it goes in R9
, even if it's the first integer arg. Unlike in the x86-64 System V calling convention, where the first integer arg goes in rdi
, regardless of how many earlier FP args are in registers and/or on the stack.
So the asm wrapper that calls the function can load the first 8 bytes into both integer and FP registers! (Variadic functions already require this, so a callee doesn't have to know whether to store the integer or FP register to form that arg array. MS optimized the calling convention for simplicity of variadic callee functions at the expense of efficiency for functions with a mix of integer and FP args.)
The C side that puts all the args into a buffer can look like this:
#include <stdalign.h>
int asmwrapper(const char *argbuf, size_t argp-argbuf, void (*funcpointer)(...));
void somefunc() {
alignas(16) uint64_t argbuf[256/8]; // or char argbuf[256]. But if you choose not to use alignas, then uint64_t will still give 8-byte alignment
char *argp = (char*)argbuf;
for( ; argp < &argbuf[256] ; argp += 8) {
if (figure_out_an_arg()) {
int foo = get_int_arg();
memcpy(argp, &foo, sizeof(foo));
} else if(bar) {
double foo = get_double_arg();
memcpy(argp, &foo, sizeof(foo));
} else
... memcpy whatever size
// or allocate space to pass by ref and memcpy a pointer
}
if (argp == &argbuf[256]) {
// error, ran out of space for args
}
asmwrapper(argbuf, argp-argbuf, funcpointer);
}
Unfortunately I don't think we can directly use argbuf
on the stack as the args + shadow space for a function call. We have no way of stopping the compiler from putting something valuable below argbuf
which would let us just set rsp
to the bottom of it (and save the return address somewhere, maybe at the top of argbuf
by reserving some space for use by the asm).
Anyway, just copying the whole buffer will work. Or actually, load the first 32 bytes into registers (both integer and FP), and only copy the rest. The shadow space doesn't need to be initialized.
argbuf
could be a VLA if you knew ahead of time how big it needed to be, but 256 bytes is pretty small. It's not like reading past the end of it can be a problem, it can't be at the end of a page with unmapped memory later, because our parent function's stack frame definitely takes some space.
;; NASM syntax. For MASM just rename the local labels and add whatever PROC / ENDPROC is needed.
;; UNTESTED
;; rcx: argbuf
;; rdx: length in bytes of the args. 0..256, zero-extended to 64 bits
;; r8 : function pointer
;; reserve rdx bytes of space for arg passing
;; load first 32 bytes of argbuf into integer and FP arg-passing registers
;; copy the rest as stack-args above the shadow space
global asmwrapper
asmwrapper:
push rbp
mov rbp, rsp ; so we can efficiently restore the stack later
mov r10, r8 ; move function pointer to a volatile but non-arg-passing register
; load *both* xmm0-3 and rcx,rdx,r8,r9 from the first 32 bytes of argbuf
; regardless of types or whether there were that many arg bytes
; All bytes are loaded into registers early, some reg->reg transfers are done later
; when we're done with more registers.
; movsd xmm0, [rcx]
; movsd xmm1, [rcx+8]
movaps xmm0, [rcx] ; 16-byte alignment required for argbuf. Use movups to allow misalignment if you want
movhlps xmm1, xmm0 ; use some ALU instructions instead of just loads
; rcx,rdx can't be set yet, still in use for wrapper args
movaps xmm2, [rcx+16] ; it's ok to leave garbage in the high 64-bits of an XMM passing a float or double.
;movhlps xmm3, xmm2 ; the copyloop uses xmm3: do this later
movq r8, xmm2
mov r9, [rcx+24]
mov eax, 32
cmp edx, eax
jbe .small_args ; no copying needed, just shadow space
sub rsp, rdx
and rsp, -16 ; reserve extra space, realigning the stack by 16
; rax=32 on entry, start copying just above shadow space (which doesn't need to be copied)
.copyloop: ; do {
movaps xmm3, [rcx+rax]
movaps [rsp+rax], xmm3 ; indexed addressing modes aren't always optimal, but this loop only runs a couple times.
add eax, 16
cmp eax, edx
jb .copyloop ; } while(bytes_copied < arg_bytes);
.done_arg_copying:
; xmm0,xmm1 have the first 2 qwords of args
movq rcx, xmm0 ; RCX NO LONGER POINTS AT argbuf
movq rdx, xmm1
; xmm2 still has the 2nd 16 bytes of args
;movhlps xmm3, xmm2 ; don't use: false dependency on old value and we just used it.
pshufd xmm3, xmm2, 0xee ; xmm3 = high 64 bits of xmm2. (0xee = _MM_SHUFFLE(3,2,3,2))
; movq xmm3, r9 ; nah, can be multiple uops on AMD
; r8,r9 set earlier
call r10
leave ; restore RSP to its value on entry
ret
; could handle this branchlessly, but copy loop still needs to run zero times
; unless we bump up the min arg_bytes to 48 and sometimes copy an unnecessary 16 bytes
; As much work as possible is before the first branch, so it can happen while a mispredict recovers
.small_args:
sub rsp, rax ; reserve shadow space
;rsp still aligned by 16 after push rbp
jmp .done_arg_copying
;byte count. This wrapper is 82 bytes; would be nice to fit it in 80 so we don't waste 14 bytes before the next function.
;e.g. maybe mov rcx, [rcx] instead of movq rcx, xmm0
;mov eax, $-asmwrapper
align 16
This does assemble (on Godbolt with NASM), but I haven't tested it.
It should perform pretty well, but if you get mispredicts around the cutoff from <= 32 bytes to > 32 bytes, change the branching so it always copies an extra 16 bytes. (Uncomment the cmp
/cmovb
in the version on Godbolt, but the copy loop still needs to start at 32 bytes into each buffer.)
If you often pass very few args, the 16-byte loads might hit a store-forwarding stall from two narrow stores to one wide reload, causing about an extra 8 cycles of latency. This isn't normally a throughput problem, but it can increase the latency before the called function can access its args. If out-of-order execution can't hide that, then it's worth using more load uops to load each 8-byte arg separately. (Especially into integer registers, and then from there to XMM, if the args are mostly integer. That will have lower latency than mem -> xmm -> integer.)
If you have more than a couple args, though, hopefully the first few have committed to L1d and no longer need store forwarding by the time the asm wrapper runs. Or there's enough copying of later args that the first 2 args finish their load + ALU chain early enough not to delay the critical path inside the called function.
Of course, if performance was a huge issue, you'd write the code that figures out the args in asm so you didn't need this copy stuff, or use a library interface with a fixed function signature that a C compiler can call directly. I did try to make this suck as little as possible on modern Intel / AMD mainstream CPUs (http://agner.org/optimize/), but I didn't benchmark it or tune it, so probably it could be improved with some time spent profiling it, especially for some real use-case.
If you know that FP args aren't a possibility for the first 4, you can simplify by just loading integer regs.
So you need to call a function (in a DLL) but only at run-time can you figure out the number and type of parameters. Then you need to perpare the parameters, either on the stack or in registers, depending on the Application Binary Interface/calling convention.
I would use the following approach: some component of your program figures out the number and type of parameters. Let's assume it creates a list of {type, value}, {type, value}, ...
You then pass this list to a function to prepare the ABI call. This will be an assembler function. For a stack-based ABI (32 bit), it just pushes the parameters on to the stack. For a register based ABI, it can prepare the register values and save them as local variables (add sp,nnn
) and once all parameters have been prepared (possibly using registers needed for the call, hence first saving them), loads the registers (a series of mov
instructions) and performs the call
instruction.