I have a function that is doing memcpy, but it\'s taking up an enormous amount of cycles. Is there a faster alternative/approach than using memcpy to move a piece of memory?
memory to memory is usually supported in CPU's command set, and memcpy will usually use that. And this is usually the fastest way.
You should check what exactly your CPU is doing. On Linux, watch for swapi in and out and virtual memory effectiveness with sar -B 1 or vmstat 1 or by looking in /proc/memstat. You may see that your copy has to push out a lot of pages to free space, or read them in, etc.
That would mean your problem isn't in what you use for the copy, but how your system uses memory. You may need to decrease file cache or start writing out earlier, or lock the pages in memory, etc.
Actually, memcpy is NOT the fastest way, especially if you call it many times. I also had some code that I really needed to speed up, and memcpy is slow because it has too many unnecessary checks. For example, it checks to see if the destination and source memory blocks overlap and if it should start copying from the back of the block rather than the front. If you do not care about such considerations, you can certainly do significantly better. I have some code, but here is perhaps an ever better version:
Very fast memcpy for image processing?.
If you search, you can find other implementations as well. But for true speed you need an assembly version.
You should check the assembly code generated for your code. What you don't want is to have the memcpy
call generate a call to the memcpy
function in the standard library - what you want is to have a repeated call to the best ASM instruction to copy the largest amount of data - something like rep movsq
.
How can you achieve this? Well, the compiler optimizes calls to memcpy
by replacing it with simple mov
s as long as it knows how much data it should copy. You can see this if you write a memcpy
with a well determined (constexpr
) value. If the compiler doesn't know the value, it will have to fall back to the byte-level implementation of memcpy
- the issue being that memcpy
has to respect the one-byte granularity. It will still move 128 bits at a time, but after each 128b it will have to check if it has enough data to copy as 128b or it has to fall back to 64bits, then to 32 and 8 (I think that 16 might be suboptimal anyway, but I don't know for sure).
So what you want is either be able to tell to memcpy
what's the size of your data with const expressions that the compiler can optimize. This way no call to memcpy
is performed. What you don't want is to pass to memcpy
a variable that will only be known at run-time. That translates into a function call and tons of tests to check the best copy instruction. Sometimes, a simple for loop is better than memcpy
for this reason (eliminating one function call). And what you really really don't want is pass to memcpy
an odd number of bytes to copy.
This function could cause data abort exception if one of the pointers (input arguments) are not aligned to 32bits.
memcpy
is likely to be the fastest way you can copy bytes around in memory. If you need something faster - try figuring out a way of not copying things around, e.g. swap pointers only, not the data itself.
If your platform supports it, look into if you can use the mmap() system call to leave your data in the file... generally the OS can manage that better. And, as everyone has been saying, avoid copying if at all possible; pointers are your friend in cases like this.