why is c++ std::max_element so slow?

前端未结

关注

 3  2115

失恋的感觉 2021-01-31 02:30

I need to find the max element in the vector so I\'m using std::max_element, but I\'ve found that it\'s a very slow function, so I wrote my own version and manage t

3条回答

生来不讨喜 (楼主)

2021-01-31 03:26
Before voting on this answer, please test (and verify) this on your machine and comment/add the results. Note that I used a vector size of 1000*1000*1000 for my tests. Currently, this answer has 19 upvotes but only one posted results, and these results did not show the effect described below (though obtained with a different test code, see comments).

There seems to be an optimizer bug/artifact. Compare the times of:
```
template
_ForwardIterator
my_max_element_orig(_ForwardIterator __first, _ForwardIterator __last,
_Compare __comp)
{
  if (__first == __last) return __first;
  _ForwardIterator __result = __first;

  while(++__first != __last)
    if (__comp(__result, __first))
      __result = __first;

  return __result;
}

template
_ForwardIterator
my_max_element_changed(_ForwardIterator __first, _ForwardIterator __last,
_Compare __comp)
{
  if (__first == __last) return __first;
  _ForwardIterator __result = __first;
  ++__first;

  for(; __first != __last; ++__first)
    if (__comp(__result, __first))
      __result = __first;

  return __result;
}
```
The first is the original libstdc++ implementation, the second one should be a transformation without any changes in behaviour or requirements. Clang++ produces very similar run times for those two functions, whereas g++4.8.2 is four times faster with the second version.

Following Maxim's proposal, changing the vector from int to int64_t, the changed version is not 4, but only 1.7 times faster than the original version (g++4.8.2).

The difference is in predictive commoning of *result, that is, storing the value of the current max element so that it does not have to be reloaded from memory each time. This gives a far cleaner cache access pattern:
```
w/o commoning     with commoning
*                 *
**                 *
 **                 *
  **                 *
  * *                 *
  *  *                 *
  *   *                 *
```
Here's the asm for comparison (rdi/rsi contain the first/last iterators respectively):

With the while loop (2.88743 ms; gist):
```
    movq    %rdi, %rax
    jmp .L49
.L51:
    movl    (%rdi), %edx
    cmpl    %edx, (%rax)
    cmovl   %rdi, %rax
.L49:
    addq    $4, %rdi
    cmpq    %rsi, %rdi
    jne .L51
```
With the for loop (1235.55 μs):
```
    leaq    4(%rdi), %rdx
    movq    %rdi, %rax
    cmpq    %rsi, %rdx
    je  .L53
    movl    (%rdi), %ecx
.L54:
    movl    (%rdx), %r8d
    cmpl    %r8d, %ecx
    cmovl   %rdx, %rax
    cmovl   %r8d, %ecx
    addq    $4, %rdx
    cmpq    %rdx, %rsi
    jne .L54
.L53:
```
If I force commoning by explicitly storing *result into a variable prev at the start and whenever result is updated, and using prev instead of *result in the comparison, I get an even faster loop (377.601 μs):
```
    movl    (%rdi), %ecx
    movq    %rdi, %rax
.L57:
    addq    $4, %rdi
    cmpq    %rsi, %rdi
    je  .L60
.L59:
    movl    (%rdi), %edx
    cmpl    %edx, %ecx
    jge .L57
    movq    %rdi, %rax
    addq    $4, %rdi
    movl    %edx, %ecx
    cmpq    %rsi, %rdi
    jne .L59
.L60:
```
The reason this is faster than the for loop is that the conditional moves (cmovl) in the above are a pessimisation as they are executed so rarely (Linus says that cmov is only a good idea if the branch is unpredictable). Note that for randomly distributed data the branch is expected to be taken Hn times, which is a negligible proportion (H_n grows logarithmically, so H_n/n rapidly approaches 0). The conditional-move code will only be better on pathological data e.g. [1, 0, 3, 2, 5, 4, ...].
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...