faster alternative to memcpy?

后端 未结 16 1043
一生所求
一生所求 2020-11-29 21:27

I have a function that is doing memcpy, but it\'s taking up an enormous amount of cycles. Is there a faster alternative/approach than using memcpy to move a piece of memory?

相关标签:
16条回答
  • 2020-11-29 21:48

    Check you Compiler/Platform manual. For some micro-processors and DSP-kits using memcpy is much slower than intrinsic functions or DMA operations.

    0 讨论(0)
  • 2020-11-29 21:48

    You may want to have a look at this:

    http://www.danielvik.com/2010/02/fast-memcpy-in-c.html

    Another idea I would try is to use COW techniques to duplicate the memory block and let the OS handle the copying on demand as soon as the page is written to. There are some hints here using mmap(): Can I do a copy-on-write memcpy in Linux?

    0 讨论(0)
  • 2020-11-29 21:50

    I assume you must have huge areas of memory that you want to copy around, if the performance of memcpy has become an issue for you?

    In this case, I'd agree with nos's suggestion to figure out some way NOT to copy stuff..

    Instead of having one huge blob of memory to be copied around whenever you need to change it, you should probably try some alternative data structures instead.

    Without really knowing anything about your problem area, I would suggest taking a good look at persistent data structures and either implementing one of your own or reusing an existing implementation.

    0 讨论(0)
  • 2020-11-29 21:50

    Here's some benchmarks Visual C++/Ryzen 1700.

    The benchmark copies 16 KiB (non-overlapping) chunks of data from a 128 MiB ring buffer 8*8192 times (in total, 1 GiB of data is copied).

    I then normalize the result, here we present wall clock time in milliseconds and a throughput value for 60 Hz (i.e. how much data can this function process over 16.667 milliseconds).

    memcpy                           2.761 milliseconds ( 772.555 MiB/frame)
    

    As you can see the builtin memcpy is fast, but how fast?

    64-wide load/store              39.889 milliseconds (  427.853 MiB/frame)
    32-wide load/store              33.765 milliseconds (  505.450 MiB/frame)
    16-wide load/store              24.033 milliseconds (  710.129 MiB/frame)
     8-wide load/store              23.962 milliseconds (  712.245 MiB/frame)
     4-wide load/store              22.965 milliseconds (  743.176 MiB/frame)
     2-wide load/store              22.573 milliseconds (  756.072 MiB/frame)
     1-wide load/store              35.032 milliseconds (  487.169 MiB/frame)
    

    The above is just the code below with variations of n.

    // n is the "wideness" from the benchmark
    
    auto src = (__m128i*)get_src_chunk();
    auto dst = (__m128i*)get_dst_chunk();
    
    for (int32_t i = 0; i < (16 * 1024) / (16 * n); i += n) {
      __m128i temp[n];
    
      for (int32_t i = 0; i < n; i++) {
        temp[i] = _mm_loadu_si128(dst++);
      }
    
      for (int32_t i = 0; i < n; i++) {
        _mm_store_si128(src++, temp[i]);
      }
    }
    

    These are my best guesses for the results that I have. Based on what I know about the Zen microarchitecture it can only fetch 32 bytes per cycle. That's why we max out at 2x 16-byte load/store.

    • The 1x load the bytes into xmm0, 128-bit
    • The 2x load the bytes into ymm0, 256-bit

    And that's why it is about twice as fast, and internally exactly what memcpy does (or what it should be doing if you enable the right optimizations for your platform).

    It is also impossible to make this faster since we are now limited by the cache bandwidth which doesn't go any faster. I think this is a quite important fact to point our because if you are memory bound and looking for faster solution, you will be looking for a very long time.

    0 讨论(0)
提交回复
热议问题