Inlining of vararg functions

前端 未结 4 1959
不知归路
不知归路 2021-01-17 08:38

While playing about with optimisation settings, I noticed an interesting phenomenon: functions taking a variable number of arguments (...) never seemed to get i

相关标签:
4条回答
  • 2021-01-17 08:48

    The variable arguments implementation generally have the following algorithm: Take the first address from the stack which is after the format string, and while parsing the input format string use the value at the given position as the required datatype. Now increment the stack parsing pointer with the size of the required datatype, advance in the format string and use the value at the new position as the required datatype ... and so on.

    Some values automatically get converted (ie: promoted) to "larger" types (and this is more or less implementation dependant) such as char or short gets promoted to int and float to double.

    Certainly, you do not need a format string, but in this case you need to know the type of the arguments passed in (such as: all ints, or all doubles, or the first 3 ints, then 3 more doubles ..).

    So this is the short theory.

    Now, to the practice, as the comment from n.m. above shows, gcc does not inline functions which have variable argument handling. Possibly there are pretty complex operations going on while handling the variable arguments which would increase the size of the code to an un-optimal size so it is simply not worth inlining these functions.

    EDIT:

    After doing a quick test with VS2012 I don't seem to be able to convince the compiler to inline the function with the variable arguments. Regardless of the combination of flags in the "Optimization" tab of the project there is always a call totest and there is always a test method. And indeed:

    http://msdn.microsoft.com/en-us/library/z8y1yy88.aspx

    says that

    Even with __forceinline, the compiler cannot inline code in all circumstances. The compiler cannot inline a function if: ...

    • The function has a variable argument list.
    0 讨论(0)
  • 2021-01-17 08:51

    The point of inlining is that it reduces function call overhead.

    But for varargs, there is very little to be gained in general.
    Consider this code in the body of that function:

    if (blah)
    {
        printf("%d", va_arg(vl, int));
    }
    else
    {
        printf("%s", va_arg(vl, char *));
    }
    

    How is the compiler supposed to inline it? Doing that requires the compiler to push everything on the stack in the correct order anyway, even though there isn't any function being called. The only thing that's optimized away is a call/ret instruction pair (and maybe pushing/popping ebp and whatnot). The memory manipulations cannot be optimized away, and the parameters cannot be passed in registers. So it's unlikely that you'll gain anything notable by inlining varargs.

    0 讨论(0)
  • 2021-01-17 08:52

    At least on x86-64, the passing of var_args is quite complex (due to passing arguments in registers). Other architectures may not be quite so complex, but it is rarely trivial. In particular, having a stack-frame or frame pointer to refer to when getting each argument may be required. These sort of rules may well stop the compiler from inlining the function.

    The code for x86-64 includes pushing all the integer arguments, and 8 sse registers onto the stack.

    This is the function from the original code compiled with Clang:

    test:                                   # @test
        subq    $200, %rsp
        testb   %al, %al
        je  .LBB1_2
    # BB#1:                                 # %entry
        movaps  %xmm0, 48(%rsp)
        movaps  %xmm1, 64(%rsp)
        movaps  %xmm2, 80(%rsp)
        movaps  %xmm3, 96(%rsp)
        movaps  %xmm4, 112(%rsp)
        movaps  %xmm5, 128(%rsp)
        movaps  %xmm6, 144(%rsp)
        movaps  %xmm7, 160(%rsp)
    .LBB1_2:                                # %entry
        movq    %r9, 40(%rsp)
        movq    %r8, 32(%rsp)
        movq    %rcx, 24(%rsp)
        movq    %rdx, 16(%rsp)
        movq    %rsi, 8(%rsp)
        leaq    (%rsp), %rax
        movq    %rax, 192(%rsp)
        leaq    208(%rsp), %rax
        movq    %rax, 184(%rsp)
        movl    $48, 180(%rsp)
        movl    $8, 176(%rsp)
        movq    stdout(%rip), %rdi
        leaq    176(%rsp), %rdx
        movl    $.L.str, %esi
        callq   vfprintf
        addq    $200, %rsp
        retq
    

    and from gcc:

    test.constprop.0:
        .cfi_startproc
        subq    $216, %rsp
        .cfi_def_cfa_offset 224
        testb   %al, %al
        movq    %rsi, 40(%rsp)
        movq    %rdx, 48(%rsp)
        movq    %rcx, 56(%rsp)
        movq    %r8, 64(%rsp)
        movq    %r9, 72(%rsp)
        je  .L2
        movaps  %xmm0, 80(%rsp)
        movaps  %xmm1, 96(%rsp)
        movaps  %xmm2, 112(%rsp)
        movaps  %xmm3, 128(%rsp)
        movaps  %xmm4, 144(%rsp)
        movaps  %xmm5, 160(%rsp)
        movaps  %xmm6, 176(%rsp)
        movaps  %xmm7, 192(%rsp)
    .L2:
        leaq    224(%rsp), %rax
        leaq    8(%rsp), %rdx
        movl    $.LC0, %esi
        movq    stdout(%rip), %rdi
        movq    %rax, 16(%rsp)
        leaq    32(%rsp), %rax
        movl    $8, 8(%rsp)
        movl    $48, 12(%rsp)
        movq    %rax, 24(%rsp)
        call    vfprintf
        addq    $216, %rsp
        .cfi_def_cfa_offset 8
        ret
        .cfi_endproc
    

    In clang for x86, it is much simpler:

    test:                                   # @test
        subl    $28, %esp
        leal    36(%esp), %eax
        movl    %eax, 24(%esp)
        movl    stdout, %ecx
        movl    %eax, 8(%esp)
        movl    %ecx, (%esp)
        movl    $.L.str, 4(%esp)
        calll   vfprintf
        addl    $28, %esp
        retl
    

    There's nothing really stopping any of the above code from being inlined as such, so it would appear that it is simply a policy decision on the compiler writer. Of course, for a call to something like printf, it's pretty meaningless to optimise away a call/return pair for the cost of the code expansion - after all, printf is NOT a small short function.

    (A decent part of my work for most of the past year has been to implement printf in an OpenCL environment, so I know far more than most people will ever even look up about format specifiers and various other tricky parts of printf)

    Edit: The OpenCL compiler we use WILL inline calls to var_args functions, so it is possible to implement such a thing. It won't do it for calls to printf, because it bloats the code very much, but by default, our compiler inlines EVERYTHING, all the time, no matter what it is... And it does work, but we found that having 2-3 copies of printf in the code makes it REALLY huge (with all sorts of other drawbacks, including final code generation taking a lot longer due to some bad choices of algorithms in the compiler backend), so we had to add code to STOP the compiler doing that...

    0 讨论(0)
  • 2021-01-17 09:15

    I do not expect that it would ever be possible to inline a varargs function, except in the most trivial case.

    A varargs function that had no arguments, or that did not access any of its arguments, or that accessed only the fixed arguments preceding the variable ones could be inlined by rewriting it as an equivalent function that did not use varargs. This is the trivial case.

    A varargs function that accesses its variadic arguments does so by executing code generated by the va_start and va_arg macros, which rely on the arguments being laid out in memory in some way. A compiler that performed inlining simply to remove the overhead of a function call would still need to create the data structure to support those macros. A compiler that attempted to remove all the machinery of function call would have to analyse and optimise away those macros as well. And it would still fail if the variadic function made a call to another function passing va_list as an argument.

    I do not see a feasible path for this second case.

    0 讨论(0)
提交回复
热议问题