In what cases should I use memcpy over standard operators in C++?

When can I get better performance using memcpy or how do I benefit from using it? For example:

float a[3]; float b[3];

is code:

memcpy(a, b, 3*sizeof(float));

faster than this one?

a[0] = b[0];
a[1] = b[1];
a[2] = b[2];

Efficiency should not be your concern.
Write clean maintainable code.

It bothers me that so many answers indicate that the memcpy() is inefficient. It is designed to be the most efficient way of copy blocks of memory (for C programs).

So I wrote the following as a test:

#include <algorithm>

extern float a[3];
extern float b[3];
extern void base();

int main()
{
    base();

#if defined(M1)
    a[0] = b[0];
    a[1] = b[1];
    a[2] = b[2];
#elif defined(M2)
    memcpy(a, b, 3*sizeof(float));    
#elif defined(M3)
    std::copy(&a[0], &a[3], &b[0]);
 #endif

    base();
}

Then to compare the code produces:

g++ -O3 -S xr.cpp -o s0.s
g++ -O3 -S xr.cpp -o s1.s -DM1
g++ -O3 -S xr.cpp -o s2.s -DM2
g++ -O3 -S xr.cpp -o s3.s -DM3

echo "=======" >  D
diff s0.s s1.s >> D
echo "=======" >> D
diff s0.s s2.s >> D
echo "=======" >> D
diff s0.s s3.s >> D

This resulted in: (comments added by hand)

=======   // Copy by hand
10a11,18
>   movq    _a@GOTPCREL(%rip), %rcx
>   movq    _b@GOTPCREL(%rip), %rdx
>   movl    (%rdx), %eax
>   movl    %eax, (%rcx)
>   movl    4(%rdx), %eax
>   movl    %eax, 4(%rcx)
>   movl    8(%rdx), %eax
>   movl    %eax, 8(%rcx)

=======    // memcpy()
10a11,16
>   movq    _a@GOTPCREL(%rip), %rcx
>   movq    _b@GOTPCREL(%rip), %rdx
>   movq    (%rdx), %rax
>   movq    %rax, (%rcx)
>   movl    8(%rdx), %eax
>   movl    %eax, 8(%rcx)

=======    // std::copy()
10a11,14
>   movq    _a@GOTPCREL(%rip), %rsi
>   movl    $12, %edx
>   movq    _b@GOTPCREL(%rip), %rdi
>   call    _memmove

Added Timing results for running the above inside a loop of 1000000000.

   g++ -c -O3 -DM1 X.cpp
   g++ -O3 X.o base.o -o m1
   g++ -c -O3 -DM2 X.cpp
   g++ -O3 X.o base.o -o m2
   g++ -c -O3 -DM3 X.cpp
   g++ -O3 X.o base.o -o m3
   time ./m1

   real 0m2.486s
   user 0m2.478s
   sys  0m0.005s
   time ./m2

   real 0m1.859s
   user 0m1.853s
   sys  0m0.004s
   time ./m3

   real 0m1.858s
   user 0m1.851s
   sys  0m0.006s

crazylammer

You can use memcpy only if the objects you're copying have no explicit constructors, so as their members (so-called POD, "Plain Old Data"). So it is OK to call memcpy for float, but it is wrong for, e.g., std::string.

But part of the work has already been done for you: std::copy from <algorithm> is specialized for built-in types (and possibly for every other POD-type - depends on STL implementation). So writing std::copy(a, a + 3, b) is as fast (after compiler optimization) as memcpy, but is less error-prone.

Compilers specifically optimize memcpy calls, at least clang & gcc does. So you should prefer it wherever you can.

Use std::copy(). As the header file for g++ notes:

This inline function will boil down to a call to @c memmove whenever possible.

Probably, Visual Studio's is not much different. Go with the normal way, and optimize once you're aware of a bottle neck. In the case of a simple copy, the compiler is probably already optimizing for you.

Don't go for premature micro-optimisations such as using memcpy like this. Using assignment is clearer and less error-prone and any decent compiler will generate suitably efficient code. If, and only if, you have profiled the code and found the assignments to be a significant bottleneck then you can consider some kind of micro-optimisation, but in general you should always write clear, robust code in the first instance.

The benefits of memcpy? Probably readability. Otherwise, you would have to either do a number of assignments or have a for loop for copying, neither of which are as simple and clear as just doing memcpy (of course, as long as your types are simple and don't require construction/destruction).

Also, memcpy is generally relatively optimized for specific platforms, to the point that it won't be all that much slower than simple assignment, and may even be faster.

Supposedly, as Nawaz said, the assignment version should be faster on most platform. That's because memcpy() will copy byte by byte while the second version could copy 4 bytes at a time.

As it's always the case, you should always profile applications to be sure that what you expect to be the bottleneck matches the reality.

Edit
Same applies to dynamic array. Since you mention C++ you should use std::copy() algorithm in that case.

Edit
This is code output for Windows XP with GCC 4.5.0, compiled with -O3 flag:

extern "C" void cpy(float* d, float* s, size_t n)
{
    memcpy(d, s, sizeof(float)*n);
}

I have done this function because OP specified dynamic arrays too.

Output assembly is the following:

_cpy:
LFB393:
    pushl   %ebp
LCFI0:
    movl    %esp, %ebp
LCFI1:
    pushl   %edi
LCFI2:
    pushl   %esi
LCFI3:
    movl    8(%ebp), %eax
    movl    12(%ebp), %esi
    movl    16(%ebp), %ecx
    sall    $2, %ecx
    movl    %eax, %edi
    rep movsb
    popl    %esi
LCFI4:
    popl    %edi
LCFI5:
    leave
LCFI6:
    ret

of course, I assume all of the experts here knows what rep movsb means.

This is the assignment version:

extern "C" void cpy2(float* d, float* s, size_t n)
{
    while (n > 0) {
        d[n] = s[n];
        n--;
    }
}

which yields the following code:

_cpy2:
LFB394:
    pushl   %ebp
LCFI7:
    movl    %esp, %ebp
LCFI8:
    pushl   %ebx
LCFI9:
    movl    8(%ebp), %ebx
    movl    12(%ebp), %ecx
    movl    16(%ebp), %eax
    testl   %eax, %eax
    je  L2
    .p2align 2,,3
L5:
    movl    (%ecx,%eax,4), %edx
    movl    %edx, (%ebx,%eax,4)
    decl    %eax
    jne L5
L2:
    popl    %ebx
LCFI10:
    leave
LCFI11:
    ret

Which moves 4 bytes at a time.

来源：https://stackoverflow.com/questions/4544804/in-what-cases-should-i-use-memcpy-over-standard-operators-in-c

标签

c++

performance

memory

allocation

copying