faster alternative to memcpy?

后端未结

关注

 16  1064

I have a function that is doing memcpy, but it\'s taking up an enormous amount of cycles. Is there a faster alternative/approach than using memcpy to move a piece of memory?

相关标签:

16条回答

误落风尘

2020-11-29 21:48

Check you Compiler/Platform manual. For some micro-processors and DSP-kits using memcpy is much slower than intrinsic functions or DMA operations.

0 讨论(0)
发布评论:

提交评论
- 加载中...
臣服心动

2020-11-29 21:48

You may want to have a look at this:

http://www.danielvik.com/2010/02/fast-memcpy-in-c.html

Another idea I would try is to use COW techniques to duplicate the memory block and let the OS handle the copying on demand as soon as the page is written to. There are some hints here using mmap(): Can I do a copy-on-write memcpy in Linux?

0 讨论(0)
发布评论:

提交评论
- 加载中...
闹比i

2020-11-29 21:50

I assume you must have huge areas of memory that you want to copy around, if the performance of memcpy has become an issue for you?

In this case, I'd agree with nos's suggestion to figure out some way NOT to copy stuff..

Instead of having one huge blob of memory to be copied around whenever you need to change it, you should probably try some alternative data structures instead.

Without really knowing anything about your problem area, I would suggest taking a good look at persistent data structures and either implementing one of your own or reusing an existing implementation.

0 讨论(0)
发布评论:

提交评论
- 加载中...
梦如初夏

2020-11-29 21:50
Here's some benchmarks Visual C++/Ryzen 1700.

The benchmark copies 16 KiB (non-overlapping) chunks of data from a 128 MiB ring buffer 8*8192 times (in total, 1 GiB of data is copied).

I then normalize the result, here we present wall clock time in milliseconds and a throughput value for 60 Hz (i.e. how much data can this function process over 16.667 milliseconds).
```
memcpy                           2.761 milliseconds ( 772.555 MiB/frame)
```
As you can see the builtin memcpy is fast, but how fast?
```
64-wide load/store              39.889 milliseconds (  427.853 MiB/frame)
32-wide load/store              33.765 milliseconds (  505.450 MiB/frame)
16-wide load/store              24.033 milliseconds (  710.129 MiB/frame)
 8-wide load/store              23.962 milliseconds (  712.245 MiB/frame)
 4-wide load/store              22.965 milliseconds (  743.176 MiB/frame)
 2-wide load/store              22.573 milliseconds (  756.072 MiB/frame)
 1-wide load/store              35.032 milliseconds (  487.169 MiB/frame)
```
The above is just the code below with variations of n.
```
// n is the "wideness" from the benchmark

auto src = (__m128i*)get_src_chunk();
auto dst = (__m128i*)get_dst_chunk();

for (int32_t i = 0; i < (16 * 1024) / (16 * n); i += n) {
  __m128i temp[n];

  for (int32_t i = 0; i < n; i++) {
    temp[i] = _mm_loadu_si128(dst++);
  }

  for (int32_t i = 0; i < n; i++) {
    _mm_store_si128(src++, temp[i]);
  }
}
```
These are my best guesses for the results that I have. Based on what I know about the Zen microarchitecture it can only fetch 32 bytes per cycle. That's why we max out at 2x 16-byte load/store.
- The 1x load the bytes into xmm0, 128-bit
- The 2x load the bytes into ymm0, 256-bit
And that's why it is about twice as fast, and internally exactly what memcpy does (or what it should be doing if you enable the right optimizations for your platform).

It is also impossible to make this faster since we are now limited by the cache bandwidth which doesn't go any faster. I think this is a quite important fact to point our because if you are memory bound and looking for faster solution, you will be looking for a very long time.
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2 3