Hoisting the dynamic type out of a loop (a.k.a. doing Java the C++ way)

后端 未结 4 847
清歌不尽
清歌不尽 2020-12-06 05:23

I was discussing the merits of \"modern\" languages compared to C++ with some friends recently, when the following came up (I think inspired by Java):

Does any C++ c

相关标签:
4条回答
  • 2020-12-06 05:46

    If you're interested in this kind of thing, check out Agner Fog's excellent Software Optimization Manuals. This question is tangentially addressed in the first of the five, Optimizing C++ (pdf) (the others are all about assembly - he's kind of old-school).

    If f() is a const function, or its return value when called on p is otherwise guaranteed to be unchanged, it can be pulled out of the loop and only calculated once (see "Loop Invariant Code Motion", page 70). Most compilers will do this (see "Comparison of Different Compilers", page 74).

    If that can't be done, then it might still be possible to devirtualize. But this can't be done in a callable function, because that would have to use a virtual lookup for the sake of correctness. But if the function was inlined, and the type of p was known in the calling scope, it could be done. The calling code would have to look something like this:

    A* aptr = new A(42); // <- The compiler knows exactly what type aptr points to
    acc(a, 100);         // <- This would have to be inlined!
    

    But according to that table (page 74), only the GCC compilers make this optimization.

    Finally, the closest optimization (I think) to what you're asking. Could the compiler perform the virtual lookup once, store a function pointer, and then use that function pointer to avoid the virtual lookup inside the loop? I don't see why not. But I don't know if any compilers do so - it's an obscure enough optimization that it's not even mentioned in Agner Fog's compulsively detailed C++ manual.

    For what it's worth, here's what he has to say about function pointers (page 38):

    Calling a function through a function pointer typically takes a few clock cycles more than calling the function directly if the target address can be predicted. The target address is predicted if the value of the function pointer is the same as last time the statement was executed. If the value of the function pointer has changed then the target address is likely to be mispredicted, which causes a long delay. See page 44 about branch prediction. A Pentium M processor may be able to predict the target if the changes of the function pointer follows a simple regular pattern, while Pentium 4 and AMD processors are sure to make a misprediction every time the function pointer has changed.

    And an excerpt about virtual member functions (page 54):

    The time it takes to call a virtual member function is a few clock cycles more than it takes to call a non-virtual member function, provided that the function call statement always calls the same version of the virtual function. If the version changes then you may get a misprediction penalty of 10 - 20 clock cycles. The rules for prediction and misprediction of virtual function calls is the same as for switch statements, as explained on page 45.

    The dispatching mechanism can be bypassed when the virtual function is called on an object of known type, but you cannot always rely on the compiler bypassing the dispatch mechanism even when it would be obvious to do so. See page 73.

    You know the function pointer wouldn't change in your example, so you wouldn't get the misprediction penalty, but he never compares function pointer performance to virtual function performance directly. Both just take "a few" more clock cycles than a regular function call. Maybe it's the same mechanism - if so, that "optimization" would just be adding an extra lookup.

    So it's hard to say, really. The best way to get an answer might just be to have your favourite compiler spit out some optimized assembly and dig through it (unpleasant, but conclusive!).

    Hope this helps!

    0 讨论(0)
  • 2020-12-06 05:51

    I compiled the above code:

    The only change I made was to make the methods const as the parameter 'p' to acc() was also const.
    When I compiled it (on a macbook) using g++ 4.2.1 and -O3 I get the following code (this looks like the loop in acc()).

    Does not look like it is chaining through the lookup table.
    It is a simple get via a register that already has vtable set up.

     57 L9:
     58     movq    (%r12), %rax   // Get the location of f() method address via the r12 register
     59     movq    %r12, %rdi     // Set up rdi register as `this` (for after call)
     60     call    *(%rax)        // Call the F() method. address is in memory pointed at by rax
     61     addl    %eax, %r14d
     62     incl    %ebx
     63     cmpl    %r13d, %ebx
     64     jne L9
    

    If I remove the virtual descriptors from the lines the same code is:

     76 L16:
     77     movq    %r14, %rdi     // Set up rdi register as `this` (for after call)
     78     call    __ZNK1A1fEv    // Call the F() method.
     79     addl    %eax, %r13d
     80     incl    %ebx
     81     cmpl    %r12d, %ebx
     82     jne L16
    

    So the difference in the above code is really:

    movq    (%r12), %rax     This is a register to register copy.
                             The cost of this is practically nothing and you could never
                             detect it. No matter how many times you called the function.
    
    call    *(%rax)          Here we have to look up the address to call by getting it
                             from memory. Now this could be expensive.
    
                             But in reality is not. The first time this is called the
                             memory will be placed in an in-chip memory cache (if it is
                             not there you will get a processor stall while it is loaded
                             from memory (or the next cache up)) but after that it will
                             be really fast.
    
                             But it is not quite as fast as just calling the address (for
                             the non virtual version). But the difference is insignificant
                             and other factors in the code will drown out any gains or
                             just in pure noise of the measurements.
    

    So to answer the question. No the address of the function is not cached for re-use. It is looked up each time through the loop.

    Source that was compiled:

    #include <iostream>
    
    struct A { virtual int f() const { return 0; } };
    
    struct B : A { virtual int f() const { return 1; }};
    
    int acc(const A * p, unsigned int N)
    {
        int result = 0;
    
        for (unsigned int i = 0; i != N; ++i)
            result += p->f();  // #1
    
        return result;
    }
    
    int main()
    {
        A       a;
        B       b;
        std::cout << acc(&a, 20) << "\n";
        std::cout << acc(&b, 22) << "\n";
    }
    

    Full Assembley:

      1     .mod_init_func
      2     .align 3
      3     .quad   __GLOBAL__I__Z3accPK1Aj
      4     .section __TEXT,__textcoal_nt,coalesced,pure_instructions
      5     .align 1
      6     .align 4
      7 .globl __ZNK1A1fEv
      8     .weak_definition __ZNK1A1fEv
      9 __ZNK1A1fEv:
     10 LFB1477:
     11     pushq   %rbp
     12 LCFI0:
     13     movq    %rsp, %rbp
     14 LCFI1:
     15     xorl    %eax, %eax
     16     leave
     17     ret
     18 LFE1477:
     19     .align 1
     20     .align 4
     21 .globl __ZNK1B1fEv
     22     .weak_definition __ZNK1B1fEv
     23 __ZNK1B1fEv:
     24 LFB1478:
     25     pushq   %rbp
     26 LCFI2:
     27     movq    %rsp, %rbp
     28 LCFI3:
     29     movl    $1, %eax
     30     leave
     31     ret
     32 LFE1478:
     33     .text
     34     .align 4,0x90
     35 .globl __Z3accPK1Aj
     36 __Z3accPK1Aj:
     37 LFB1479:
     38     pushq   %rbp
     39 LCFI4:
     40     movq    %rsp, %rbp
     41 LCFI5:
     42     pushq   %r14
     43 LCFI6:
     44     pushq   %r13
     45 LCFI7:
     46     pushq   %r12
     47 LCFI8:
     48     pushq   %rbx
     49 LCFI9:
     50     movq    %rdi, %r12
     51     movl    %esi, %r13d
     52     xorl    %r14d, %r14d
     53     testl   %esi, %esi
     54     je  L8
     55     xorl    %ebx, %ebx
     56     .align 4,0x90
     57 L9:
     58     movq    (%r12), %rax
     59     movq    %r12, %rdi
     60     call    *(%rax)
     61     addl    %eax, %r14d
     62     incl    %ebx
     63     cmpl    %r13d, %ebx
     64     jne L9
     65 L8:
     66     movl    %r14d, %eax
     67     popq    %rbx
     68     popq    %r12
     69     popq    %r13
     70     popq    %r14
     71     leave
     72     ret
     73 LFE1479:
     74     .section __TEXT,__StaticInit,regular,pure_instructions
     75     .align 4
     76 __Z41__static_initialization_and_destruction_0ii:
     77 LFB1649:
     78     pushq   %rbp
     79 LCFI10:
     80     movq    %rsp, %rbp
     81 LCFI11:
     82     decl    %edi
     83     je  L18
     84 L17:
     85     leave
     86     ret
     87     .align 4
     88 L18:
     89     cmpl    $65535, %esi
     90     jne L17
     91     leaq    __ZStL8__ioinit(%rip), %rdi
     92     call    __ZNSt8ios_base4InitC1Ev
     93     movq    ___dso_handle@GOTPCREL(%rip), %rdx
     94     xorl    %esi, %esi
     95     leaq    ___tcf_0(%rip), %rdi
     96     leave
     97     jmp ___cxa_atexit
     98 LFE1649:
     99     .align 4
    100 __GLOBAL__I__Z3accPK1Aj:
    101 LFB1651:
    102     pushq   %rbp
    103 LCFI12:
    104     movq    %rsp, %rbp
    105 LCFI13:
    106     movl    $65535, %esi
    107     movl    $1, %edi
    108     leave
    109     jmp __Z41__static_initialization_and_destruction_0ii
    110 LFE1651:
    111     .text
    112     .align 4,0x90
    113 ___tcf_0:
    114 LFB1650:
    115     pushq   %rbp
    116 LCFI14:
    117     movq    %rsp, %rbp
    118 LCFI15:
    119     leaq    __ZStL8__ioinit(%rip), %rdi
    120     leave
    121     jmp __ZNSt8ios_base4InitD1Ev
    122 LFE1650:
    123     .cstring
    124 LC0:
    125     .ascii "\12\0"
    126     .text
    127     .align 4,0x90
    128 .globl _main
    129 _main:
    130 LFB1480:
    131     pushq   %rbp
    132 LCFI16:
    133     movq    %rsp, %rbp
    134 LCFI17:
    135     pushq   %r14
    136 LCFI18:
    137     pushq   %r13
    138 LCFI19:
    139     pushq   %r12
    140 LCFI20:
    141     pushq   %rbx
    142 LCFI21:
    143     subq    $32, %rsp
    144 LCFI22:
    145     movq    __ZTV1A@GOTPCREL(%rip), %rax
    146     addq    $16, %rax
    147     movq    %rax, -48(%rbp)
    148     movq    __ZTV1B@GOTPCREL(%rip), %rax
    149     addq    $16, %rax
    150     movq    %rax, -64(%rbp)
    151     leaq    -48(%rbp), %r13
    152     movq    %r13, %rdi
    153     call    __ZNK1A1fEv
    154     movl    %eax, %ebx
    155     movl    $1, %r12d
    156     .align 4,0x90
    157 L24:
    158     movq    %r13, %rdi
    159     call    __ZNK1A1fEv
    160     addl    %eax, %ebx
    161     incl    %r12d
    162     cmpl    $20, %r12d
    163     jne L24
    164     movl    %ebx, %esi
    165     movq    __ZSt4cout@GOTPCREL(%rip), %r14
    166     movq    %r14, %rdi
    167     call    __ZNSolsEi
    168     movq    %rax, %rdi
    169     movl    $1, %edx
    170     leaq    LC0(%rip), %rsi
    171     call    __ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l
    172     leaq    -64(%rbp), %r13
    173     movq    %r13, %rdi
    174     movq    -64(%rbp), %rax
    175     call    *(%rax)
    176     movl    %eax, %ebx
    177     movb    $1, %r12b
    178     .align 4,0x90
    179 L26:
    180     movq    %r13, %rdi
    181     movq    -64(%rbp), %rax
    182     call    *(%rax)
    183     addl    %eax, %ebx
    184     incl    %r12d
    185     cmpl    $22, %r12d
    186     jne L26
    187     movl    %ebx, %esi
    188     movq    %r14, %rdi
    189     call    __ZNSolsEi
    190     movq    %rax, %rdi
    191     movl    $1, %edx
    192     leaq    LC0(%rip), %rsi
    193     call    __ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l
    194     xorl    %eax, %eax
    195     addq    $32, %rsp
    196     popq    %rbx
    197     popq    %r12
    198     popq    %r13
    199     popq    %r14
    200     leave
    201     ret
    202 LFE1480:
    203 .lcomm __ZStL8__ioinit,1,0
    204 .globl __ZTV1A
    205     .weak_definition __ZTV1A
    206     .section __DATA,__const_coal,coalesced
    207     .align 4
    208 __ZTV1A:
    209     .quad   0
    210     .quad   __ZTI1A
    211     .quad   __ZNK1A1fEv
    212 .globl __ZTI1A
    213     .weak_definition __ZTI1A
    214     .align 4
    215 __ZTI1A:
    216     .quad   __ZTVN10__cxxabiv117__class_type_infoE+16
    217     .quad   __ZTS1A
    218 .globl __ZTS1A
    219     .weak_definition __ZTS1A
    220     .section __TEXT,__const_coal,coalesced
    221 __ZTS1A:
    222     .ascii "1A\0"
    223 .globl __ZTV1B
    224     .weak_definition __ZTV1B
    225     .section __DATA,__const_coal,coalesced
    226     .align 4
    227 __ZTV1B:
    228     .quad   0
    229     .quad   __ZTI1B
    230     .quad   __ZNK1B1fEv
    231 .globl __ZTI1B
    232     .weak_definition __ZTI1B
    233     .align 4
    234 __ZTI1B:
    235     .quad   __ZTVN10__cxxabiv120__si_class_type_infoE+16
    236     .quad   __ZTS1B
    237     .quad   __ZTI1A
    238 .globl __ZTS1B
    239     .weak_definition __ZTS1B
    240     .section __TEXT,__const_coal,coalesced
    241 __ZTS1B:
    242     .ascii "1B\0"
    243     .section __TEXT,__eh_frame,coalesced,no_toc+strip_static_syms+live_support
    244 EH_frame1:
    245     .set L$set$0,LECIE1-LSCIE1
    246     .long L$set$0
    247 LSCIE1:
    248     .long   0x0
    249     .byte   0x1
    250     .ascii "zPR\0"
    251     .byte   0x1
    252     .byte   0x78
    253     .byte   0x10
    254     .byte   0x6
    255     .byte   0x9b
    256     .long   ___gxx_personality_v0+4@GOTPCREL
    257     .byte   0x10
    258     .byte   0xc
    259     .byte   0x7
    260     .byte   0x8
    261     .byte   0x90
    262     .byte   0x1
    263     .align 3
    264 LECIE1:
    265 .globl __ZNK1A1fEv.eh
    266     .weak_definition __ZNK1A1fEv.eh
    267 __ZNK1A1fEv.eh:
    268 LSFDE1:
    269     .set L$set$1,LEFDE1-LASFDE1
    270     .long L$set$1
    271 LASFDE1:
    272     .long   LASFDE1-EH_frame1
    273     .quad   LFB1477-.
    274     .set L$set$2,LFE1477-LFB1477
    275     .quad L$set$2
    276     .byte   0x0
    277     .byte   0x4
    278     .set L$set$3,LCFI0-LFB1477
    279     .long L$set$3
    280     .byte   0xe
    281     .byte   0x10
    282     .byte   0x86
    283     .byte   0x2
    284     .byte   0x4
    285     .set L$set$4,LCFI1-LCFI0
    286     .long L$set$4
    287     .byte   0xd
    288     .byte   0x6
    289     .align 3
    290 LEFDE1:
    291 .globl __ZNK1B1fEv.eh
    292     .weak_definition __ZNK1B1fEv.eh
    293 __ZNK1B1fEv.eh:
    294 LSFDE3:
    295     .set L$set$5,LEFDE3-LASFDE3
    296     .long L$set$5
    297 LASFDE3:
    298     .long   LASFDE3-EH_frame1
    299     .quad   LFB1478-.
    300     .set L$set$6,LFE1478-LFB1478
    301     .quad L$set$6
    302     .byte   0x0
    303     .byte   0x4
    304     .set L$set$7,LCFI2-LFB1478
    305     .long L$set$7
    306     .byte   0xe
    307     .byte   0x10
    308     .byte   0x86
    309     .byte   0x2
    310     .byte   0x4
    311     .set L$set$8,LCFI3-LCFI2
    312     .long L$set$8
    313     .byte   0xd
    314     .byte   0x6
    315     .align 3
    316 LEFDE3:
    317 .globl __Z3accPK1Aj.eh
    318 __Z3accPK1Aj.eh:
    319 LSFDE5:
    320     .set L$set$9,LEFDE5-LASFDE5
    321     .long L$set$9
    322 LASFDE5:
    323     .long   LASFDE5-EH_frame1
    324     .quad   LFB1479-.
    325     .set L$set$10,LFE1479-LFB1479
    326     .quad L$set$10
    327     .byte   0x0
    328     .byte   0x4
    329     .set L$set$11,LCFI4-LFB1479
    330     .long L$set$11
    331     .byte   0xe
    332     .byte   0x10
    333     .byte   0x86
    334     .byte   0x2
    335     .byte   0x4
    336     .set L$set$12,LCFI5-LCFI4
    337     .long L$set$12
    338     .byte   0xd
    339     .byte   0x6
    340     .byte   0x4
    341     .set L$set$13,LCFI9-LCFI5
    342     .long L$set$13
    343     .byte   0x83
    344     .byte   0x6
    345     .byte   0x8c
    346     .byte   0x5
    347     .byte   0x8d
    348     .byte   0x4
    349     .byte   0x8e
    350     .byte   0x3
    351     .align 3
    352 LEFDE5:
    353 __Z41__static_initialization_and_destruction_0ii.eh:
    354 LSFDE7:
    355     .set L$set$14,LEFDE7-LASFDE7
    356     .long L$set$14
    357 LASFDE7:
    358     .long   LASFDE7-EH_frame1
    359     .quad   LFB1649-.
    360     .set L$set$15,LFE1649-LFB1649
    361     .quad L$set$15
    362     .byte   0x0
    363     .byte   0x4
    364     .set L$set$16,LCFI10-LFB1649
    365     .long L$set$16
    366     .byte   0xe
    367     .byte   0x10
    368     .byte   0x86
    369     .byte   0x2
    370     .byte   0x4
    371     .set L$set$17,LCFI11-LCFI10
    372     .long L$set$17
    373     .byte   0xd
    374     .byte   0x6
    375     .align 3
    376 LEFDE7:
    377 __GLOBAL__I__Z3accPK1Aj.eh:
    378 LSFDE9:
    379     .set L$set$18,LEFDE9-LASFDE9
    380     .long L$set$18
    381 LASFDE9:
    382     .long   LASFDE9-EH_frame1
    383     .quad   LFB1651-.
    384     .set L$set$19,LFE1651-LFB1651
    385     .quad L$set$19
    386     .byte   0x0
    387     .byte   0x4
    388     .set L$set$20,LCFI12-LFB1651
    389     .long L$set$20
    390     .byte   0xe
    391     .byte   0x10
    392     .byte   0x86
    393     .byte   0x2
    394     .byte   0x4
    395     .set L$set$21,LCFI13-LCFI12
    396     .long L$set$21
    397     .byte   0xd
    398     .byte   0x6
    399     .align 3
    400 LEFDE9:
    401 ___tcf_0.eh:
    402 LSFDE11:
    403     .set L$set$22,LEFDE11-LASFDE11
    404     .long L$set$22
    405 LASFDE11:
    406     .long   LASFDE11-EH_frame1
    407     .quad   LFB1650-.
    408     .set L$set$23,LFE1650-LFB1650
    409     .quad L$set$23
    410     .byte   0x0
    411     .byte   0x4
    412     .set L$set$24,LCFI14-LFB1650
    413     .long L$set$24
    414     .byte   0xe
    415     .byte   0x10
    416     .byte   0x86
    417     .byte   0x2
    418     .byte   0x4
    419     .set L$set$25,LCFI15-LCFI14
    420     .long L$set$25
    421     .byte   0xd
    422     .byte   0x6
    423     .align 3
    424 LEFDE11:
    425 .globl _main.eh
    426 _main.eh:
    427 LSFDE13:
    428     .set L$set$26,LEFDE13-LASFDE13
    429     .long L$set$26
    430 LASFDE13:
    431     .long   LASFDE13-EH_frame1
    432     .quad   LFB1480-.
    433     .set L$set$27,LFE1480-LFB1480
    434     .quad L$set$27
    435     .byte   0x0
    436     .byte   0x4
    437     .set L$set$28,LCFI16-LFB1480
    438     .long L$set$28
    439     .byte   0xe
    440     .byte   0x10
    441     .byte   0x86
    442     .byte   0x2
    443     .byte   0x4
    444     .set L$set$29,LCFI17-LCFI16
    445     .long L$set$29
    446     .byte   0xd
    447     .byte   0x6
    448     .byte   0x4
    449     .set L$set$30,LCFI22-LCFI17
    450     .long L$set$30
    451     .byte   0x83
    452     .byte   0x6
    453     .byte   0x8c
    454     .byte   0x5
    455     .byte   0x8d
    456     .byte   0x4
    457     .byte   0x8e
    458     .byte   0x3
    459     .align 3
    460 LEFDE13:
    461     .constructor
    462     .destructor
    463     .align 1
    464     .subsections_via_symbols
    
    0 讨论(0)
  • 2020-12-06 05:57

    Here's the required template version:

    struct A { int f() const { return 0; } };
    template<class T>
    struct B { B(T &t) : t(t) { } int f() const { return t.f()+1; } T &t; };
    
    template<class T>
    int acc(const T *p, unsigned int N)
    {
       int result = 0;
    
       for(unsigned int i = 0; i != N; ++i)
         result += p->f();
       return result;
    }
    

    And usage is:

    int main() {
       A a;
       B<A> obj(a);
       int result = acc(&obj, 10);
    }
    
    0 讨论(0)
  • 2020-12-06 06:01

    It has been pointed out to me that GCC has an extension, called "bound member functions", that does indeed allow you to store the actual function pointer. Demo:

    struct Foo
    {
        virtual ~Foo() { }
        virtual int f(int, int) = 0;
    };
    
    void f(Foo & x)
    {
        using gcc_func_type = int (*)(Foo *, int, int);
    
        gcc_func_type fp = (gcc_func_type)(x.*&Foo::f);  // !
    
        for ( /* ... */ )
        {
            int result = fp(&x, 10, 20);   // no virtual dispatch!
        }
    }
    

    The syntax requires that you go through a pointer-to-member indirection (i.e. you cannot just write (x.f)), and the cast must be a C-style cast. The resulting function pointer has the type of a pointer to a free function, with the instance argument taken as the first parameter.

    0 讨论(0)
提交回复
热议问题