I have a function that is doing memcpy, but it\'s taking up an enormous amount of cycles. Is there a faster alternative/approach than using memcpy to move a piece of memory?
Check you Compiler/Platform manual. For some micro-processors and DSP-kits using memcpy is much slower than intrinsic functions or DMA operations.
You may want to have a look at this:
http://www.danielvik.com/2010/02/fast-memcpy-in-c.html
Another idea I would try is to use COW techniques to duplicate the memory block and let the OS handle the copying on demand as soon as the page is written to. There are some hints here using mmap()
: Can I do a copy-on-write memcpy in Linux?
I assume you must have huge areas of memory that you want to copy around, if the performance of memcpy has become an issue for you?
In this case, I'd agree with nos's suggestion to figure out some way NOT to copy stuff..
Instead of having one huge blob of memory to be copied around whenever you need to change it, you should probably try some alternative data structures instead.
Without really knowing anything about your problem area, I would suggest taking a good look at persistent data structures and either implementing one of your own or reusing an existing implementation.
Here's some benchmarks Visual C++/Ryzen 1700.
The benchmark copies 16 KiB (non-overlapping) chunks of data from a 128 MiB ring buffer 8*8192 times (in total, 1 GiB of data is copied).
I then normalize the result, here we present wall clock time in milliseconds and a throughput value for 60 Hz (i.e. how much data can this function process over 16.667 milliseconds).
memcpy 2.761 milliseconds ( 772.555 MiB/frame)
As you can see the builtin memcpy
is fast, but how fast?
64-wide load/store 39.889 milliseconds ( 427.853 MiB/frame)
32-wide load/store 33.765 milliseconds ( 505.450 MiB/frame)
16-wide load/store 24.033 milliseconds ( 710.129 MiB/frame)
8-wide load/store 23.962 milliseconds ( 712.245 MiB/frame)
4-wide load/store 22.965 milliseconds ( 743.176 MiB/frame)
2-wide load/store 22.573 milliseconds ( 756.072 MiB/frame)
1-wide load/store 35.032 milliseconds ( 487.169 MiB/frame)
The above is just the code below with variations of n
.
// n is the "wideness" from the benchmark
auto src = (__m128i*)get_src_chunk();
auto dst = (__m128i*)get_dst_chunk();
for (int32_t i = 0; i < (16 * 1024) / (16 * n); i += n) {
__m128i temp[n];
for (int32_t i = 0; i < n; i++) {
temp[i] = _mm_loadu_si128(dst++);
}
for (int32_t i = 0; i < n; i++) {
_mm_store_si128(src++, temp[i]);
}
}
These are my best guesses for the results that I have. Based on what I know about the Zen microarchitecture it can only fetch 32 bytes per cycle. That's why we max out at 2x 16-byte load/store.
xmm0
, 128-bitymm0
, 256-bitAnd that's why it is about twice as fast, and internally exactly what memcpy
does (or what it should be doing if you enable the right optimizations for your platform).
It is also impossible to make this faster since we are now limited by the cache bandwidth which doesn't go any faster. I think this is a quite important fact to point our because if you are memory bound and looking for faster solution, you will be looking for a very long time.