Performance of x86 rep instructions on modern (pipelined/superscalar) processors

南楼画角 提交于 2019-11-28 17:57:10

There is a lot of space given to questions like this in both AMD and Intel's optimization guides. Validity of advice given in this area has a "half life" - different CPU generations behave differently, for example:

The Intel Architecture Optimization Manual gives performance comparison figures for various block copy techniques (including rep stosd) on Table 7-2. Relative Performance of Memory Copy Routines, pg. 7-37f., for different CPUs, and again what's fastest on one might not be fastest on others.

For many cases, recent x86 CPUs (which have the "string" SSE4.2 operations) can do string operations via the SIMD unit, see this investigation.

To follow up on all this (and/or keep yourself updated when things change again, inevitably), read Agner Fog's Optimization guides/blogs.

In addition to FrankH's excellent answer; I'd like to point out that which method is best also depends on the length of the string, its alignment, and if the length is fixed or variable.

For small strings (maybe up to about 16 bytes) doing it manually with simple instructions is probably faster, as it avoids the setup costs of more complex techniques (and for fixed size strings can be easily unrolled). For medium sized strings (maybe from 16 bytes to 4 KiB) something like "REP MOVSD" (with some "MOVSB" instructions thrown in if misalignment is possible) is likely to be best.

For anything larger than that, some people would be tempted to go into SSE/AVX and prefetching, etc. A better idea is to fix the caller/s so that copying (or strlen() or whatever) isn't needed in the first place. If you try hard enough, you'll almost always find a way. Note: Also be very wary of "supposed" fast mempcy() routines - typically they've been tested on massive strings and not tested on far more likely tiny/small/medium strings.

Also note that (for the purpose of optimisation rather than convenience) due to all these differences (likely length, alignment, fixed or variable size, CPU type, etc) the idea of having one multi-purpose "memcpy()" for all of the very different cases is near-sighted.

Since no one has given you any numbers, I'll give you some which I've found by benchmarking my garbage collector which is very memcpy-heavy. My objects to be copied are 60% 16 bytes in length and the remainder 30% are 500 - 8000 bytes or so.

  • Precondition: Both dst , src and n are multiples of 8.
  • Processor: AMD Phenom(tm) II X6 1090T Processor 64bit/linux

Here are my three memcpy variants:

Hand-coded while-loop:

if (n == 16) {
    *dst++ = *src++;
    *dst++ = *src++;
} else {
    size_t n_ptrs = n / sizeof(ptr);
    ptr *end = dst + n_ptrs;
    while (dst < end) {
        *dst++ = *src++;
    }
}

(ptr is an alias to uintptr_t). Time: 101.16%

rep movsb

if (n == 16) {
    *dst++ = *src++;
    *dst++ = *src++;
} else {
    asm volatile("cld\n\t"
                 "rep ; movsb"
                 : "=D" (dst), "=S" (src)
                 : "c" (n), "D" (dst), "S" (src)
                 : "memory");
}

Time: 103.22%

rep movsq

if (n == 16) {
    *dst++ = *src++;
    *dst++ = *src++;
} else {
    size_t n_ptrs = n / sizeof(ptr);
    asm volatile("cld\n\t"
                 "rep ; movsq"
                 : "=D" (dst), "=S" (src)
                 : "c" (n_ptrs), "D" (dst), "S" (src)
                 : "memory");
}

Time: 100.00%

req movsq wins by a tiny margin.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!