Why is std::pair faster than std::tuple

前端 未结 2 986
温柔的废话
温柔的废话 2020-12-12 23:50

Here is the code for testing.

Tuple test:

using namespace std;

int main(){

    vector> v;


    for (int var = 0; var &l         


        
2条回答
  •  囚心锁ツ
    2020-12-13 00:39

    You are missing some crucial information: What compiler do you use? What do you use to measure the performance of the microbenchmark? What standard library implementation do you use?

    My system:

    g++ (GCC) 4.9.1 20140903 (prerelease)
    GLIBCXX_3.4.20
    

    Anyhow, I ran your examples, but reserved the proper size of the vectors first to get rid of the memory allocation overhead. With that, I funnily observe the opposite something interesting - the reverse of what you see:

    g++ -std=c++11 -O2 pair.cpp -o pair
    perf stat -r 10 -d ./pair
    Performance counter stats for './pair' (10 runs):
    
          1647.045151      task-clock:HG (msec)      #    0.993 CPUs utilized            ( +-  1.94% )
                  346      context-switches:HG       #    0.210 K/sec                    ( +- 40.13% )
                    7      cpu-migrations:HG         #    0.004 K/sec                    ( +- 22.01% )
              182,978      page-faults:HG            #    0.111 M/sec                    ( +-  0.04% )
        3,394,685,602      cycles:HG                 #    2.061 GHz                      ( +-  2.24% ) [44.38%]
        2,478,474,676      stalled-cycles-frontend:HG #   73.01% frontend cycles idle     ( +-  1.24% ) [44.55%]
        1,550,747,174      stalled-cycles-backend:HG #   45.68% backend  cycles idle     ( +-  1.60% ) [44.66%]
        2,837,484,461      instructions:HG           #    0.84  insns per cycle        
                                                      #    0.87  stalled cycles per insn  ( +-  4.86% ) [55.78%]
          526,077,681      branches:HG               #  319.407 M/sec                    ( +-  4.52% ) [55.82%]
              829,623      branch-misses:HG          #    0.16% of all branches          ( +-  4.42% ) [55.74%]
          594,396,822      L1-dcache-loads:HG        #  360.887 M/sec                    ( +-  4.74% ) [55.59%]
            20,842,113      L1-dcache-load-misses:HG  #    3.51% of all L1-dcache hits    ( +-  0.68% ) [55.46%]
            5,474,166      LLC-loads:HG              #    3.324 M/sec                    ( +-  1.81% ) [44.23%]
            LLC-load-misses:HG       
    
          1.658671368 seconds time elapsed                                          ( +-  1.82% )
    

    versus:

    g++ -std=c++11 -O2 tuple.cpp -o tuple
    perf stat -r 10 -d ./tuple
    Performance counter stats for './tuple' (10 runs):
    
            996.090514      task-clock:HG (msec)      #    0.996 CPUs utilized            ( +-  2.41% )
                  102      context-switches:HG       #    0.102 K/sec                    ( +- 64.61% )
                    4      cpu-migrations:HG         #    0.004 K/sec                    ( +- 32.24% )
              181,701      page-faults:HG            #    0.182 M/sec                    ( +-  0.06% )
        2,052,505,223      cycles:HG                 #    2.061 GHz                      ( +-  2.22% ) [44.45%]
        1,212,930,513      stalled-cycles-frontend:HG #   59.10% frontend cycles idle     ( +-  2.94% ) [44.56%]
          621,104,447      stalled-cycles-backend:HG #   30.26% backend  cycles idle     ( +-  3.48% ) [44.69%]
        2,700,410,991      instructions:HG           #    1.32  insns per cycle        
                                                      #    0.45  stalled cycles per insn  ( +-  1.66% ) [55.94%]
          486,476,408      branches:HG               #  488.386 M/sec                    ( +-  1.70% ) [55.96%]
              959,651      branch-misses:HG          #    0.20% of all branches          ( +-  4.78% ) [55.82%]
          547,000,119      L1-dcache-loads:HG        #  549.147 M/sec                    ( +-  2.19% ) [55.67%]
            21,540,926      L1-dcache-load-misses:HG  #    3.94% of all L1-dcache hits    ( +-  2.73% ) [55.43%]
            5,751,650      LLC-loads:HG              #    5.774 M/sec                    ( +-  3.60% ) [44.21%]
            LLC-load-misses:HG       
    
          1.000126894 seconds time elapsed                                          ( +-  2.47% )
    

    as you can see, in my case the reason are the much higher number of stalled cycles, both in the frontend as well as in the backend.

    Now where does this come from? I bet it comes down to some failed inlining, similar to what is explained here: std::vector performance regression when enabling C++11

    Indeed, enabling -flto equalizes the results for me:

    Performance counter stats for './pair' (10 runs):
    
          1021.922944      task-clock:HG (msec)      #    0.997 CPUs utilized            ( +-  1.15% )
                    63      context-switches:HG       #    0.062 K/sec                    ( +- 77.23% )
                    5      cpu-migrations:HG         #    0.005 K/sec                    ( +- 34.21% )
              195,396      page-faults:HG            #    0.191 M/sec                    ( +-  0.00% )
        2,109,877,147      cycles:HG                 #    2.065 GHz                      ( +-  0.92% ) [44.33%]
        1,098,031,078      stalled-cycles-frontend:HG #   52.04% frontend cycles idle     ( +-  0.93% ) [44.46%]
          701,553,535      stalled-cycles-backend:HG #   33.25% backend  cycles idle     ( +-  1.09% ) [44.68%]
        3,288,420,630      instructions:HG           #    1.56  insns per cycle        
                                                      #    0.33  stalled cycles per insn  ( +-  0.88% ) [55.89%]
          672,941,736      branches:HG               #  658.505 M/sec                    ( +-  0.80% ) [56.00%]
              660,278      branch-misses:HG          #    0.10% of all branches          ( +-  2.05% ) [55.93%]
          474,314,267      L1-dcache-loads:HG        #  464.139 M/sec                    ( +-  1.32% ) [55.73%]
            19,481,787      L1-dcache-load-misses:HG  #    4.11% of all L1-dcache hits    ( +-  0.80% ) [55.51%]
            5,155,678      LLC-loads:HG              #    5.045 M/sec                    ( +-  1.69% ) [44.21%]
            LLC-load-misses:HG       
    
          1.025083895 seconds time elapsed                                          ( +-  1.03% )
    

    and for tuple:

    Performance counter stats for './tuple' (10 runs):
    
          1018.980969      task-clock:HG (msec)      #    0.999 CPUs utilized            ( +-  0.47% )
                    8      context-switches:HG       #    0.008 K/sec                    ( +- 29.74% )
                    3      cpu-migrations:HG         #    0.003 K/sec                    ( +- 42.64% )
              195,396      page-faults:HG            #    0.192 M/sec                    ( +-  0.00% )
        2,103,574,740      cycles:HG                 #    2.064 GHz                      ( +-  0.30% ) [44.28%]
        1,088,827,212      stalled-cycles-frontend:HG #   51.76% frontend cycles idle     ( +-  0.47% ) [44.56%]
          697,438,071      stalled-cycles-backend:HG #   33.15% backend  cycles idle     ( +-  0.41% ) [44.76%]
        3,305,631,646      instructions:HG           #    1.57  insns per cycle        
                                                      #    0.33  stalled cycles per insn  ( +-  0.21% ) [55.94%]
          675,175,757      branches:HG               #  662.599 M/sec                    ( +-  0.16% ) [56.02%]
              656,205      branch-misses:HG          #    0.10% of all branches          ( +-  0.98% ) [55.93%]
          475,532,976      L1-dcache-loads:HG        #  466.675 M/sec                    ( +-  0.13% ) [55.69%]
            19,430,992      L1-dcache-load-misses:HG  #    4.09% of all L1-dcache hits    ( +-  0.20% ) [55.49%]
            5,161,624      LLC-loads:HG              #    5.065 M/sec                    ( +-  0.47% ) [44.14%]
            LLC-load-misses:HG       
    
          1.020225388 seconds time elapsed                                          ( +-  0.48% )
    

    So remember, -flto is your friend and failed inlining can have extreme results on heavily templated code. Use perf stat to find out what's happening.

提交回复
热议问题