Why is `std::copy` 5x (!) slower than `memcpy` for reading one int from a char buffer, in my test program?

前端未结

关注

 6  1744

This is a follow-up to this question where I posted this program:

#include 
#include 
#include 
#include


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  不思量自难忘°        
                
              
                            
                2021-02-05 15:15
              
            
            
                                                                       
Looks to me like the answer is that gcc can optimize these particular calls to memmove and memcpy, but not std::copy. gcc is aware of the semantics of memmove and memcpy, and in this case can take advantage of the fact that the size is known (sizeof(int)) to turn the call into a single mov instruction.

std::copy is implemented in terms of memcpy, but apparently the gcc optimizer doesn't manage to figure out that data + sizeof(int) - data is exactly sizeof(int). So the benchmark calls memcpy.

I got all that by invoking gcc with -S and flipping quickly through the output; I could easily have gotten it wrong, but what I saw seems consistent with your measurements.

By the way, I think the test is more or less meaningless. A more plausible real-world test might be creating an actual vector<int> src and an int[N] dst, and then comparing memcpy(dst, src.data(), sizeof(int)*src.size()) with std::copy(src.begin(), src.end(), &dst).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  离开以前        
                
              
                            
                2021-02-05 15:15
              
            
            
                                                                       
According to assembler output of G++ 4.8.1, test_memcpy:

movl    (%r15), %r15d


test_std_copy:

movl    $4, %edx
movq    %r15, %rsi
leaq    16(%rsp), %rdi
call    memcpy


As you can see, std::copy successfully recognized that it can copy data with memcpy, but for some reason further inlining did not happen - so that is the reason of performance difference.

By the way, Clang 3.4 produces identical code for both cases:

movl    (%r14,%rbx), %ebp

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  旧巷少年郎        
                
              
                            
                2021-02-05 15:24
              
            
            
                                                                       
I agree with @rici's comment about developing a more meaningful benchmark so I rewrote your test to benchmark copying of two vectors using memcpy(), memmove(), std::copy() and the std::vector assignment operator:

#include <algorithm>
#include <iostream>
#include <vector>
#include <chrono>
#include <random>
#include <cstring>
#include <cassert>

typedef std::vector<int> vector_type;

void test_memcpy(vector_type & destv, vector_type const & srcv)
{
    vector_type::pointer       const dest = destv.data();
    vector_type::const_pointer const src  = srcv.data();

    std::memcpy(dest, src, srcv.size() * sizeof(vector_type::value_type));
}

void test_memmove(vector_type & destv, vector_type const & srcv)
{
    vector_type::pointer       const dest = destv.data();
    vector_type::const_pointer const src  = srcv.data();

    std::memmove(dest, src, srcv.size() * sizeof(vector_type::value_type));
}

void test_std_copy(vector_type & dest, vector_type const & src)
{
    std::copy(src.begin(), src.end(), dest.begin());
}

void test_assignment(vector_type & dest, vector_type const & src)
{
    dest = src;
}

auto
benchmark(std::function<void(vector_type &, vector_type const &)> copy_func)
    ->decltype(std::chrono::milliseconds().count())
{
    std::random_device rd;
    std::mt19937 generator(rd());
    std::uniform_int_distribution<vector_type::value_type> distribution;

    static vector_type::size_type const num_elems = 2000;

    vector_type dest(num_elems);
    vector_type src(num_elems);

    // Fill the source and destination vectors with random data.
    for (vector_type::size_type i = 0; i < num_elems; ++i) {
        src.push_back(distribution(generator));
        dest.push_back(distribution(generator));
    }

    static int const iterations = 50000;

    std::chrono::time_point<std::chrono::system_clock> start, end;

    start = std::chrono::system_clock::now();

    for (int i = 0; i != iterations; ++i)
        copy_func(dest, src);

    end = std::chrono::system_clock::now();

    assert(src == dest);

    return
        std::chrono::duration_cast<std::chrono::milliseconds>(
            end - start).count();
}

int main()
{
    std::cout
        << "memcpy:     " << benchmark(test_memcpy)     << " ms" << std::endl
        << "memmove:    " << benchmark(test_memmove)    << " ms" << std::endl
        << "std::copy:  " << benchmark(test_std_copy)   << " ms" << std::endl
        << "assignment: " << benchmark(test_assignment) << " ms" << std::endl
        << std::endl;
}


I went a little overboard with C++11 just for fun.

Here are the results I get on my 64 bit Ubuntu box with g++ 4.6.3:

$ g++ -O3 -std=c++0x foo.cpp ; ./a.out 
memcpy:     33 ms
memmove:    33 ms
std::copy:  33 ms
assignment: 34 ms


The results are all quite comparable!  I get comparable times in all test cases when I change the integer type, e.g. to long long, in the vector as well.

Unless my benchmark rewrite is broken, it looks like your own benchmark isn't performing a valid comparison.  HTH!
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  温柔的废话        
                
              
                            
                2021-02-05 15:26
              
            
            
                                                                       
EDIT: I leave this answer for reference, the odd timings with gcc seem to be an artifact of "code alignment" (see comments)



I was about to say that this was an implementation glitch in gcc 4 at the time, but it might be more complicated than that. 
My results are (used 20000/20000 for the counters):

$ g++ -Ofast a.cpp; ./a.out
cast:      24 ms
memcpy:    47 ms
memmove:   24 ms
std::copy: 24 ms
(counter:  1787289600)

$ g++ -O3 a.cpp; ./a.out
cast:      24 ms
memcpy:    24 ms
memmove:   24 ms
std::copy: 47 ms
(counter:  1787289600)


$ g++ --version
g++ (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008


Notice how copy and memcpy results swap when compiling with -O3 and -Ofast. Also memmove is not slower than either.

In clang the results are simpler:

$ clang++ -O3 a.cpp; ./a.out
cast:      26 ms
memcpy:    26 ms
memmove:   26 ms
std::copy: 26 ms
(counter:  1787289600)

$ clang++ -Ofast a.cpp; ./a.out
cast:      26 ms
memcpy:    26 ms
memmove:   26 ms
std::copy: 26 ms
(counter:  1787289600)


$ clang++ --version
clang version 9.0.0-2 (tags/RELEASE_900/final)


perf results: https://pastebin.com/BZCZiAWQ
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  暗喜        
                
              
                            
                2021-02-05 15:27
              
            
            
                                                                       
memcpy and std::copy each have their uses, std::copy should(as pointed out by Cheers below) be as slow as memmove because there is no guarantee the memory regions will overlap. This means you can copy non-contiguous regions very easily (as it supports iterators) (think of sparsely allocated structures like linked list etc.... even custom classes/structures that implement iterators). memcpy only work on contiguous reasons and as such can be heavily optimized.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  春和景丽        
                
              
                            
                2021-02-05 15:30
              
            
            
                                                                       
That is not the results I get:

> g++ -O3 XX.cpp 
> ./a.out
cast:      5 ms
memcpy:    4 ms
std::copy: 3 ms
(counter:  1264720400)

Hardware: 2GHz Intel Core i7
Memory:   8G 1333 MHz DDR3
OS:       Max OS X 10.7.5
Compiler: i686-apple-darwin11-llvm-g++-4.2 (GCC) 4.2.1


On a Linux box I get different results:

> g++ -std=c++0x -O3 XX.cpp 
> ./a.out 
cast:      3 ms
memcpy:    4 ms
std::copy: 21 ms
(counter:  731359744)


Hardware:  Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Memory:    61363780 kB
OS:        Linux ip-10-58-154-83 3.2.0-29-virtual #46-Ubuntu SMP
Compiler:  g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复