why is c++ std::max_element so slow?

前端 未结 3 2135
失恋的感觉
失恋的感觉 2021-01-31 02:30

I need to find the max element in the vector so I\'m using std::max_element, but I\'ve found that it\'s a very slow function, so I wrote my own version and manage t

相关标签:
3条回答
  • 2021-01-31 03:25

    You are probably running your test in 64-bit mode, where sizeof(int) == 4, but sizeof(std::vector<>::iterator) == 8, so that assignment in the loop to int (what my_max_element does) is faster than to std::vector<>::iterator (this is what std::max_element does).

    If you change std::vector<int> to std::vector<long> results change in favour to std::max_element:

    MaxIter = 1000000012
    MaxArray = 1000000012
    Total CPU time iterator = 0.00429082
    Total CPU time array = 0.00572205
    iter/array ratio: = 0.749875
    

    One important note: when benchmarking disable the CPU frequency scaling, so that the CPU does not switch gears in the middle of the benchmark.


    But I think something else is at play here, since just changing the loop variable from int to long does not change the results...

    0 讨论(0)
  • 2021-01-31 03:26

    Before voting on this answer, please test (and verify) this on your machine and comment/add the results. Note that I used a vector size of 1000*1000*1000 for my tests. Currently, this answer has 19 upvotes but only one posted results, and these results did not show the effect described below (though obtained with a different test code, see comments).


    There seems to be an optimizer bug/artifact. Compare the times of:

    template<typename _ForwardIterator, typename _Compare>
    _ForwardIterator
    my_max_element_orig(_ForwardIterator __first, _ForwardIterator __last,
    _Compare __comp)
    {
      if (__first == __last) return __first;
      _ForwardIterator __result = __first;
    
      while(++__first != __last)
        if (__comp(__result, __first))
          __result = __first;
    
      return __result;
    }
    
    template<typename _ForwardIterator, typename _Compare>
    _ForwardIterator
    my_max_element_changed(_ForwardIterator __first, _ForwardIterator __last,
    _Compare __comp)
    {
      if (__first == __last) return __first;
      _ForwardIterator __result = __first;
      ++__first;
    
      for(; __first != __last; ++__first)
        if (__comp(__result, __first))
          __result = __first;
    
      return __result;
    }
    

    The first is the original libstdc++ implementation, the second one should be a transformation without any changes in behaviour or requirements. Clang++ produces very similar run times for those two functions, whereas g++4.8.2 is four times faster with the second version.


    Following Maxim's proposal, changing the vector from int to int64_t, the changed version is not 4, but only 1.7 times faster than the original version (g++4.8.2).


    The difference is in predictive commoning of *result, that is, storing the value of the current max element so that it does not have to be reloaded from memory each time. This gives a far cleaner cache access pattern:

    w/o commoning     with commoning
    *                 *
    **                 *
     **                 *
      **                 *
      * *                 *
      *  *                 *
      *   *                 *
    

    Here's the asm for comparison (rdi/rsi contain the first/last iterators respectively):

    With the while loop (2.88743 ms; gist):

        movq    %rdi, %rax
        jmp .L49
    .L51:
        movl    (%rdi), %edx
        cmpl    %edx, (%rax)
        cmovl   %rdi, %rax
    .L49:
        addq    $4, %rdi
        cmpq    %rsi, %rdi
        jne .L51
    

    With the for loop (1235.55 μs):

        leaq    4(%rdi), %rdx
        movq    %rdi, %rax
        cmpq    %rsi, %rdx
        je  .L53
        movl    (%rdi), %ecx
    .L54:
        movl    (%rdx), %r8d
        cmpl    %r8d, %ecx
        cmovl   %rdx, %rax
        cmovl   %r8d, %ecx
        addq    $4, %rdx
        cmpq    %rdx, %rsi
        jne .L54
    .L53:
    

    If I force commoning by explicitly storing *result into a variable prev at the start and whenever result is updated, and using prev instead of *result in the comparison, I get an even faster loop (377.601 μs):

        movl    (%rdi), %ecx
        movq    %rdi, %rax
    .L57:
        addq    $4, %rdi
        cmpq    %rsi, %rdi
        je  .L60
    .L59:
        movl    (%rdi), %edx
        cmpl    %edx, %ecx
        jge .L57
        movq    %rdi, %rax
        addq    $4, %rdi
        movl    %edx, %ecx
        cmpq    %rsi, %rdi
        jne .L59
    .L60:
    

    The reason this is faster than the for loop is that the conditional moves (cmovl) in the above are a pessimisation as they are executed so rarely (Linus says that cmov is only a good idea if the branch is unpredictable). Note that for randomly distributed data the branch is expected to be taken Hn times, which is a negligible proportion (Hn grows logarithmically, so Hn/n rapidly approaches 0). The conditional-move code will only be better on pathological data e.g. [1, 0, 3, 2, 5, 4, ...].

    0 讨论(0)
  • 2021-01-31 03:30

    It's a simple issue of cache. To wit, the first time you load memory, in this case the contents of the vector, it's always considerably slower than if it's been recently accessed. I copied and pasted your code with GCC 4.9.

    When the functions are reversed, the ratio is 1. When they're in the original order, the ratio is 1.6.

    This still seems like a fundamental misoptimization by GCC in the case of max_element to me. However, your function times are so low, they will be dominated by CPU noise like the above cache effects, instead of any meaningful comparison.

    Reversed, Original

    0 讨论(0)
提交回复
热议问题