Why is `std::copy` 5x (!) slower than `memcpy` for reading one int from a char buffer, in my test program?

前端 未结 6 1744
名媛妹妹
名媛妹妹 2021-02-05 14:31

This is a follow-up to this question where I posted this program:

#include 
#include 
#include 
#include 

        
相关标签:
6条回答
  • 2021-02-05 15:15

    Looks to me like the answer is that gcc can optimize these particular calls to memmove and memcpy, but not std::copy. gcc is aware of the semantics of memmove and memcpy, and in this case can take advantage of the fact that the size is known (sizeof(int)) to turn the call into a single mov instruction.

    std::copy is implemented in terms of memcpy, but apparently the gcc optimizer doesn't manage to figure out that data + sizeof(int) - data is exactly sizeof(int). So the benchmark calls memcpy.

    I got all that by invoking gcc with -S and flipping quickly through the output; I could easily have gotten it wrong, but what I saw seems consistent with your measurements.

    By the way, I think the test is more or less meaningless. A more plausible real-world test might be creating an actual vector<int> src and an int[N] dst, and then comparing memcpy(dst, src.data(), sizeof(int)*src.size()) with std::copy(src.begin(), src.end(), &dst).

    0 讨论(0)
  • 2021-02-05 15:15

    According to assembler output of G++ 4.8.1, test_memcpy:

    movl    (%r15), %r15d
    

    test_std_copy:

    movl    $4, %edx
    movq    %r15, %rsi
    leaq    16(%rsp), %rdi
    call    memcpy
    

    As you can see, std::copy successfully recognized that it can copy data with memcpy, but for some reason further inlining did not happen - so that is the reason of performance difference.

    By the way, Clang 3.4 produces identical code for both cases:

    movl    (%r14,%rbx), %ebp
    
    0 讨论(0)
  • 2021-02-05 15:24

    I agree with @rici's comment about developing a more meaningful benchmark so I rewrote your test to benchmark copying of two vectors using memcpy(), memmove(), std::copy() and the std::vector assignment operator:

    #include <algorithm>
    #include <iostream>
    #include <vector>
    #include <chrono>
    #include <random>
    #include <cstring>
    #include <cassert>
    
    typedef std::vector<int> vector_type;
    
    void test_memcpy(vector_type & destv, vector_type const & srcv)
    {
        vector_type::pointer       const dest = destv.data();
        vector_type::const_pointer const src  = srcv.data();
    
        std::memcpy(dest, src, srcv.size() * sizeof(vector_type::value_type));
    }
    
    void test_memmove(vector_type & destv, vector_type const & srcv)
    {
        vector_type::pointer       const dest = destv.data();
        vector_type::const_pointer const src  = srcv.data();
    
        std::memmove(dest, src, srcv.size() * sizeof(vector_type::value_type));
    }
    
    void test_std_copy(vector_type & dest, vector_type const & src)
    {
        std::copy(src.begin(), src.end(), dest.begin());
    }
    
    void test_assignment(vector_type & dest, vector_type const & src)
    {
        dest = src;
    }
    
    auto
    benchmark(std::function<void(vector_type &, vector_type const &)> copy_func)
        ->decltype(std::chrono::milliseconds().count())
    {
        std::random_device rd;
        std::mt19937 generator(rd());
        std::uniform_int_distribution<vector_type::value_type> distribution;
    
        static vector_type::size_type const num_elems = 2000;
    
        vector_type dest(num_elems);
        vector_type src(num_elems);
    
        // Fill the source and destination vectors with random data.
        for (vector_type::size_type i = 0; i < num_elems; ++i) {
            src.push_back(distribution(generator));
            dest.push_back(distribution(generator));
        }
    
        static int const iterations = 50000;
    
        std::chrono::time_point<std::chrono::system_clock> start, end;
    
        start = std::chrono::system_clock::now();
    
        for (int i = 0; i != iterations; ++i)
            copy_func(dest, src);
    
        end = std::chrono::system_clock::now();
    
        assert(src == dest);
    
        return
            std::chrono::duration_cast<std::chrono::milliseconds>(
                end - start).count();
    }
    
    int main()
    {
        std::cout
            << "memcpy:     " << benchmark(test_memcpy)     << " ms" << std::endl
            << "memmove:    " << benchmark(test_memmove)    << " ms" << std::endl
            << "std::copy:  " << benchmark(test_std_copy)   << " ms" << std::endl
            << "assignment: " << benchmark(test_assignment) << " ms" << std::endl
            << std::endl;
    }
    

    I went a little overboard with C++11 just for fun.

    Here are the results I get on my 64 bit Ubuntu box with g++ 4.6.3:

    $ g++ -O3 -std=c++0x foo.cpp ; ./a.out 
    memcpy:     33 ms
    memmove:    33 ms
    std::copy:  33 ms
    assignment: 34 ms
    

    The results are all quite comparable! I get comparable times in all test cases when I change the integer type, e.g. to long long, in the vector as well.

    Unless my benchmark rewrite is broken, it looks like your own benchmark isn't performing a valid comparison. HTH!

    0 讨论(0)
  • 2021-02-05 15:26

    EDIT: I leave this answer for reference, the odd timings with gcc seem to be an artifact of "code alignment" (see comments)


    I was about to say that this was an implementation glitch in gcc 4 at the time, but it might be more complicated than that. My results are (used 20000/20000 for the counters):

    $ g++ -Ofast a.cpp; ./a.out
    cast:      24 ms
    memcpy:    47 ms
    memmove:   24 ms
    std::copy: 24 ms
    (counter:  1787289600)
    
    $ g++ -O3 a.cpp; ./a.out
    cast:      24 ms
    memcpy:    24 ms
    memmove:   24 ms
    std::copy: 47 ms
    (counter:  1787289600)
    
    $ g++ --version
    g++ (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008
    

    Notice how copy and memcpy results swap when compiling with -O3 and -Ofast. Also memmove is not slower than either.

    In clang the results are simpler:

    $ clang++ -O3 a.cpp; ./a.out
    cast:      26 ms
    memcpy:    26 ms
    memmove:   26 ms
    std::copy: 26 ms
    (counter:  1787289600)
    
    $ clang++ -Ofast a.cpp; ./a.out
    cast:      26 ms
    memcpy:    26 ms
    memmove:   26 ms
    std::copy: 26 ms
    (counter:  1787289600)
    
    $ clang++ --version
    clang version 9.0.0-2 (tags/RELEASE_900/final)
    

    perf results: https://pastebin.com/BZCZiAWQ

    0 讨论(0)
  • 2021-02-05 15:27

    memcpy and std::copy each have their uses, std::copy should(as pointed out by Cheers below) be as slow as memmove because there is no guarantee the memory regions will overlap. This means you can copy non-contiguous regions very easily (as it supports iterators) (think of sparsely allocated structures like linked list etc.... even custom classes/structures that implement iterators). memcpy only work on contiguous reasons and as such can be heavily optimized.

    0 讨论(0)
  • 2021-02-05 15:30

    That is not the results I get:

    > g++ -O3 XX.cpp 
    > ./a.out
    cast:      5 ms
    memcpy:    4 ms
    std::copy: 3 ms
    (counter:  1264720400)
    
    Hardware: 2GHz Intel Core i7
    Memory:   8G 1333 MHz DDR3
    OS:       Max OS X 10.7.5
    Compiler: i686-apple-darwin11-llvm-g++-4.2 (GCC) 4.2.1
    

    On a Linux box I get different results:

    > g++ -std=c++0x -O3 XX.cpp 
    > ./a.out 
    cast:      3 ms
    memcpy:    4 ms
    std::copy: 21 ms
    (counter:  731359744)
    
    
    Hardware:  Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
    Memory:    61363780 kB
    OS:        Linux ip-10-58-154-83 3.2.0-29-virtual #46-Ubuntu SMP
    Compiler:  g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
    
    0 讨论(0)
提交回复
热议问题