This is a follow-up to this question where I posted this program:
#include
#include
#include
#include
Looks to me like the answer is that gcc can optimize these particular calls to memmove and memcpy, but not std::copy. gcc is aware of the semantics of memmove and memcpy, and in this case can take advantage of the fact that the size is known (sizeof(int)) to turn the call into a single mov instruction.
std::copy is implemented in terms of memcpy, but apparently the gcc optimizer doesn't manage to figure out that data + sizeof(int) - data is exactly sizeof(int). So the benchmark calls memcpy.
I got all that by invoking gcc with -S
and flipping quickly through the output; I could easily have gotten it wrong, but what I saw seems consistent with your measurements.
By the way, I think the test is more or less meaningless. A more plausible real-world test might be creating an actual vector<int> src
and an int[N] dst
, and then comparing memcpy(dst, src.data(), sizeof(int)*src.size())
with std::copy(src.begin(), src.end(), &dst)
.
According to assembler output of G++ 4.8.1, test_memcpy
:
movl (%r15), %r15d
test_std_copy
:
movl $4, %edx
movq %r15, %rsi
leaq 16(%rsp), %rdi
call memcpy
As you can see, std::copy
successfully recognized that it can copy data with memcpy
, but for some reason further inlining did not happen - so that is the reason of performance difference.
By the way, Clang 3.4 produces identical code for both cases:
movl (%r14,%rbx), %ebp
I agree with @rici's comment about developing a more meaningful benchmark so I rewrote your test to benchmark copying of two vectors using memcpy()
, memmove()
, std::copy()
and the std::vector
assignment operator:
#include <algorithm>
#include <iostream>
#include <vector>
#include <chrono>
#include <random>
#include <cstring>
#include <cassert>
typedef std::vector<int> vector_type;
void test_memcpy(vector_type & destv, vector_type const & srcv)
{
vector_type::pointer const dest = destv.data();
vector_type::const_pointer const src = srcv.data();
std::memcpy(dest, src, srcv.size() * sizeof(vector_type::value_type));
}
void test_memmove(vector_type & destv, vector_type const & srcv)
{
vector_type::pointer const dest = destv.data();
vector_type::const_pointer const src = srcv.data();
std::memmove(dest, src, srcv.size() * sizeof(vector_type::value_type));
}
void test_std_copy(vector_type & dest, vector_type const & src)
{
std::copy(src.begin(), src.end(), dest.begin());
}
void test_assignment(vector_type & dest, vector_type const & src)
{
dest = src;
}
auto
benchmark(std::function<void(vector_type &, vector_type const &)> copy_func)
->decltype(std::chrono::milliseconds().count())
{
std::random_device rd;
std::mt19937 generator(rd());
std::uniform_int_distribution<vector_type::value_type> distribution;
static vector_type::size_type const num_elems = 2000;
vector_type dest(num_elems);
vector_type src(num_elems);
// Fill the source and destination vectors with random data.
for (vector_type::size_type i = 0; i < num_elems; ++i) {
src.push_back(distribution(generator));
dest.push_back(distribution(generator));
}
static int const iterations = 50000;
std::chrono::time_point<std::chrono::system_clock> start, end;
start = std::chrono::system_clock::now();
for (int i = 0; i != iterations; ++i)
copy_func(dest, src);
end = std::chrono::system_clock::now();
assert(src == dest);
return
std::chrono::duration_cast<std::chrono::milliseconds>(
end - start).count();
}
int main()
{
std::cout
<< "memcpy: " << benchmark(test_memcpy) << " ms" << std::endl
<< "memmove: " << benchmark(test_memmove) << " ms" << std::endl
<< "std::copy: " << benchmark(test_std_copy) << " ms" << std::endl
<< "assignment: " << benchmark(test_assignment) << " ms" << std::endl
<< std::endl;
}
I went a little overboard with C++11 just for fun.
Here are the results I get on my 64 bit Ubuntu box with g++ 4.6.3:
$ g++ -O3 -std=c++0x foo.cpp ; ./a.out
memcpy: 33 ms
memmove: 33 ms
std::copy: 33 ms
assignment: 34 ms
The results are all quite comparable! I get comparable times in all test cases when I change the integer type, e.g. to long long
, in the vector as well.
Unless my benchmark rewrite is broken, it looks like your own benchmark isn't performing a valid comparison. HTH!
EDIT: I leave this answer for reference, the odd timings with gcc seem to be an artifact of "code alignment" (see comments)
I was about to say that this was an implementation glitch in gcc 4 at the time, but it might be more complicated than that. My results are (used 20000/20000 for the counters):
$ g++ -Ofast a.cpp; ./a.out
cast: 24 ms
memcpy: 47 ms
memmove: 24 ms
std::copy: 24 ms
(counter: 1787289600)
$ g++ -O3 a.cpp; ./a.out
cast: 24 ms
memcpy: 24 ms
memmove: 24 ms
std::copy: 47 ms
(counter: 1787289600)
$ g++ --version
g++ (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008
Notice how copy
and memcpy
results swap when compiling with -O3
and -Ofast
. Also memmove
is not slower than either.
In clang
the results are simpler:
$ clang++ -O3 a.cpp; ./a.out
cast: 26 ms
memcpy: 26 ms
memmove: 26 ms
std::copy: 26 ms
(counter: 1787289600)
$ clang++ -Ofast a.cpp; ./a.out
cast: 26 ms
memcpy: 26 ms
memmove: 26 ms
std::copy: 26 ms
(counter: 1787289600)
$ clang++ --version
clang version 9.0.0-2 (tags/RELEASE_900/final)
perf
results: https://pastebin.com/BZCZiAWQ
memcpy
and std::copy
each have their uses, std::copy
should(as pointed out by Cheers below) be as slow as memmove because there is no guarantee the memory regions will overlap. This means you can copy non-contiguous regions very easily (as it supports iterators) (think of sparsely allocated structures like linked list etc.... even custom classes/structures that implement iterators). memcpy
only work on contiguous reasons and as such can be heavily optimized.
That is not the results I get:
> g++ -O3 XX.cpp
> ./a.out
cast: 5 ms
memcpy: 4 ms
std::copy: 3 ms
(counter: 1264720400)
Hardware: 2GHz Intel Core i7
Memory: 8G 1333 MHz DDR3
OS: Max OS X 10.7.5
Compiler: i686-apple-darwin11-llvm-g++-4.2 (GCC) 4.2.1
On a Linux box I get different results:
> g++ -std=c++0x -O3 XX.cpp
> ./a.out
cast: 3 ms
memcpy: 4 ms
std::copy: 21 ms
(counter: 731359744)
Hardware: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Memory: 61363780 kB
OS: Linux ip-10-58-154-83 3.2.0-29-virtual #46-Ubuntu SMP
Compiler: g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3