Inlining of vararg functions

问题

While playing about with optimisation settings, I noticed an interesting phenomenon: functions taking a variable number of arguments (...) never seemed to get inlined. (Obviously this behavior is compiler-specific, but I've tested on a couple of different systems.)

For example, compiling the following small programm:

#include <stdarg.h>
#include <stdio.h>

static inline void test(const char *format, ...)
{
  va_list ap;
  va_start(ap, format);
  vprintf(format, ap);
  va_end(ap);
}

int main()
{
  test("Hello %s\n", "world");
  return 0;
}

will seemingly always result in a (possibly mangled) test symbol appearing in the resulting executable (tested with Clang and GCC in both C and C++ modes on MacOS and Linux). If one modifies the signature of test() to take a plain string which is passed to printf(), the function is inlined from -O1 upwards by both compilers as you'd expect.

I suspect this is to do with the voodoo magic used to implement varargs, but how exactly this is usually done is a mystery to me. Can anybody enlighten me as to how compilers typically implement vararg functions, and why this seemingly prevents inlining?

回答1:

At least on x86-64, the passing of var_args is quite complex (due to passing arguments in registers). Other architectures may not be quite so complex, but it is rarely trivial. In particular, having a stack-frame or frame pointer to refer to when getting each argument may be required. These sort of rules may well stop the compiler from inlining the function.

The code for x86-64 includes pushing all the integer arguments, and 8 sse registers onto the stack.

This is the function from the original code compiled with Clang:

test:                                   # @test
    subq    $200, %rsp
    testb   %al, %al
    je  .LBB1_2
# BB#1:                                 # %entry
    movaps  %xmm0, 48(%rsp)
    movaps  %xmm1, 64(%rsp)
    movaps  %xmm2, 80(%rsp)
    movaps  %xmm3, 96(%rsp)
    movaps  %xmm4, 112(%rsp)
    movaps  %xmm5, 128(%rsp)
    movaps  %xmm6, 144(%rsp)
    movaps  %xmm7, 160(%rsp)
.LBB1_2:                                # %entry
    movq    %r9, 40(%rsp)
    movq    %r8, 32(%rsp)
    movq    %rcx, 24(%rsp)
    movq    %rdx, 16(%rsp)
    movq    %rsi, 8(%rsp)
    leaq    (%rsp), %rax
    movq    %rax, 192(%rsp)
    leaq    208(%rsp), %rax
    movq    %rax, 184(%rsp)
    movl    $48, 180(%rsp)
    movl    $8, 176(%rsp)
    movq    stdout(%rip), %rdi
    leaq    176(%rsp), %rdx
    movl    $.L.str, %esi
    callq   vfprintf
    addq    $200, %rsp
    retq

and from gcc:

test.constprop.0:
    .cfi_startproc
    subq    $216, %rsp
    .cfi_def_cfa_offset 224
    testb   %al, %al
    movq    %rsi, 40(%rsp)
    movq    %rdx, 48(%rsp)
    movq    %rcx, 56(%rsp)
    movq    %r8, 64(%rsp)
    movq    %r9, 72(%rsp)
    je  .L2
    movaps  %xmm0, 80(%rsp)
    movaps  %xmm1, 96(%rsp)
    movaps  %xmm2, 112(%rsp)
    movaps  %xmm3, 128(%rsp)
    movaps  %xmm4, 144(%rsp)
    movaps  %xmm5, 160(%rsp)
    movaps  %xmm6, 176(%rsp)
    movaps  %xmm7, 192(%rsp)
.L2:
    leaq    224(%rsp), %rax
    leaq    8(%rsp), %rdx
    movl    $.LC0, %esi
    movq    stdout(%rip), %rdi
    movq    %rax, 16(%rsp)
    leaq    32(%rsp), %rax
    movl    $8, 8(%rsp)
    movl    $48, 12(%rsp)
    movq    %rax, 24(%rsp)
    call    vfprintf
    addq    $216, %rsp
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc

In clang for x86, it is much simpler:

test:                                   # @test
    subl    $28, %esp
    leal    36(%esp), %eax
    movl    %eax, 24(%esp)
    movl    stdout, %ecx
    movl    %eax, 8(%esp)
    movl    %ecx, (%esp)
    movl    $.L.str, 4(%esp)
    calll   vfprintf
    addl    $28, %esp
    retl

There's nothing really stopping any of the above code from being inlined as such, so it would appear that it is simply a policy decision on the compiler writer. Of course, for a call to something like printf, it's pretty meaningless to optimise away a call/return pair for the cost of the code expansion - after all, printf is NOT a small short function.

(A decent part of my work for most of the past year has been to implement printf in an OpenCL environment, so I know far more than most people will ever even look up about format specifiers and various other tricky parts of printf)

Edit: The OpenCL compiler we use WILL inline calls to var_args functions, so it is possible to implement such a thing. It won't do it for calls to printf, because it bloats the code very much, but by default, our compiler inlines EVERYTHING, all the time, no matter what it is... And it does work, but we found that having 2-3 copies of printf in the code makes it REALLY huge (with all sorts of other drawbacks, including final code generation taking a lot longer due to some bad choices of algorithms in the compiler backend), so we had to add code to STOP the compiler doing that...

回答2:

The variable arguments implementation generally have the following algorithm: Take the first address from the stack which is after the format string, and while parsing the input format string use the value at the given position as the required datatype. Now increment the stack parsing pointer with the size of the required datatype, advance in the format string and use the value at the new position as the required datatype ... and so on.

Some values automatically get converted (ie: promoted) to "larger" types (and this is more or less implementation dependant) such as char or short gets promoted to int and float to double.

Certainly, you do not need a format string, but in this case you need to know the type of the arguments passed in (such as: all ints, or all doubles, or the first 3 ints, then 3 more doubles ..).

So this is the short theory.

Now, to the practice, as the comment from n.m. above shows, gcc does not inline functions which have variable argument handling. Possibly there are pretty complex operations going on while handling the variable arguments which would increase the size of the code to an un-optimal size so it is simply not worth inlining these functions.

EDIT:

After doing a quick test with VS2012 I don't seem to be able to convince the compiler to inline the function with the variable arguments. Regardless of the combination of flags in the "Optimization" tab of the project there is always a call totest and there is always a test method. And indeed:

http://msdn.microsoft.com/en-us/library/z8y1yy88.aspx

says that

Even with __forceinline, the compiler cannot inline code in all circumstances. The compiler cannot inline a function if: ...

The function has a variable argument list.

回答3:

The point of inlining is that it reduces function call overhead.

But for varargs, there is very little to be gained in general.
Consider this code in the body of that function:

if (blah)
{
    printf("%d", va_arg(vl, int));
}
else
{
    printf("%s", va_arg(vl, char *));
}

How is the compiler supposed to inline it? Doing that requires the compiler to push everything on the stack in the correct order anyway, even though there isn't any function being called. The only thing that's optimized away is a call/ret instruction pair (and maybe pushing/popping ebp and whatnot). The memory manipulations cannot be optimized away, and the parameters cannot be passed in registers. So it's unlikely that you'll gain anything notable by inlining varargs.

回答4:

I do not expect that it would ever be possible to inline a varargs function, except in the most trivial case.

A varargs function that had no arguments, or that did not access any of its arguments, or that accessed only the fixed arguments preceding the variable ones could be inlined by rewriting it as an equivalent function that did not use varargs. This is the trivial case.

A varargs function that accesses its variadic arguments does so by executing code generated by the va_start and va_arg macros, which rely on the arguments being laid out in memory in some way. A compiler that performed inlining simply to remove the overhead of a function call would still need to create the data structure to support those macros. A compiler that attempted to remove all the machinery of function call would have to analyse and optimise away those macros as well. And it would still fail if the variadic function made a call to another function passing va_list as an argument.

I do not see a feasible path for this second case.

来源：https://stackoverflow.com/questions/25482031/inlining-of-vararg-functions

标签

c++

variadic-functions

inline-functions