why is c++ std::max_element so slow?

前端 未结 3 2115
失恋的感觉
失恋的感觉 2021-01-31 02:30

I need to find the max element in the vector so I\'m using std::max_element, but I\'ve found that it\'s a very slow function, so I wrote my own version and manage t

3条回答
  •  生来不讨喜
    2021-01-31 03:26

    Before voting on this answer, please test (and verify) this on your machine and comment/add the results. Note that I used a vector size of 1000*1000*1000 for my tests. Currently, this answer has 19 upvotes but only one posted results, and these results did not show the effect described below (though obtained with a different test code, see comments).


    There seems to be an optimizer bug/artifact. Compare the times of:

    template
    _ForwardIterator
    my_max_element_orig(_ForwardIterator __first, _ForwardIterator __last,
    _Compare __comp)
    {
      if (__first == __last) return __first;
      _ForwardIterator __result = __first;
    
      while(++__first != __last)
        if (__comp(__result, __first))
          __result = __first;
    
      return __result;
    }
    
    template
    _ForwardIterator
    my_max_element_changed(_ForwardIterator __first, _ForwardIterator __last,
    _Compare __comp)
    {
      if (__first == __last) return __first;
      _ForwardIterator __result = __first;
      ++__first;
    
      for(; __first != __last; ++__first)
        if (__comp(__result, __first))
          __result = __first;
    
      return __result;
    }
    

    The first is the original libstdc++ implementation, the second one should be a transformation without any changes in behaviour or requirements. Clang++ produces very similar run times for those two functions, whereas g++4.8.2 is four times faster with the second version.


    Following Maxim's proposal, changing the vector from int to int64_t, the changed version is not 4, but only 1.7 times faster than the original version (g++4.8.2).


    The difference is in predictive commoning of *result, that is, storing the value of the current max element so that it does not have to be reloaded from memory each time. This gives a far cleaner cache access pattern:

    w/o commoning     with commoning
    *                 *
    **                 *
     **                 *
      **                 *
      * *                 *
      *  *                 *
      *   *                 *
    

    Here's the asm for comparison (rdi/rsi contain the first/last iterators respectively):

    With the while loop (2.88743 ms; gist):

        movq    %rdi, %rax
        jmp .L49
    .L51:
        movl    (%rdi), %edx
        cmpl    %edx, (%rax)
        cmovl   %rdi, %rax
    .L49:
        addq    $4, %rdi
        cmpq    %rsi, %rdi
        jne .L51
    

    With the for loop (1235.55 μs):

        leaq    4(%rdi), %rdx
        movq    %rdi, %rax
        cmpq    %rsi, %rdx
        je  .L53
        movl    (%rdi), %ecx
    .L54:
        movl    (%rdx), %r8d
        cmpl    %r8d, %ecx
        cmovl   %rdx, %rax
        cmovl   %r8d, %ecx
        addq    $4, %rdx
        cmpq    %rdx, %rsi
        jne .L54
    .L53:
    

    If I force commoning by explicitly storing *result into a variable prev at the start and whenever result is updated, and using prev instead of *result in the comparison, I get an even faster loop (377.601 μs):

        movl    (%rdi), %ecx
        movq    %rdi, %rax
    .L57:
        addq    $4, %rdi
        cmpq    %rsi, %rdi
        je  .L60
    .L59:
        movl    (%rdi), %edx
        cmpl    %edx, %ecx
        jge .L57
        movq    %rdi, %rax
        addq    $4, %rdi
        movl    %edx, %ecx
        cmpq    %rsi, %rdi
        jne .L59
    .L60:
    

    The reason this is faster than the for loop is that the conditional moves (cmovl) in the above are a pessimisation as they are executed so rarely (Linus says that cmov is only a good idea if the branch is unpredictable). Note that for randomly distributed data the branch is expected to be taken Hn times, which is a negligible proportion (Hn grows logarithmically, so Hn/n rapidly approaches 0). The conditional-move code will only be better on pathological data e.g. [1, 0, 3, 2, 5, 4, ...].

提交回复
热议问题