Performance of x86 rep instructions on modern (pipelined/superscalar) processors

前端 未结 3 483
耶瑟儿~
耶瑟儿~ 2020-12-13 04:46

i\'ve been writing in x86 assembly lately (for fun) and was wondering whether or not rep prefixed string instructions actually have a performance edge on modern processors o

相关标签:
3条回答
  • 2020-12-13 05:25

    There is a lot of space given to questions like this in both AMD and Intel's optimization guides. Validity of advice given in this area has a "half life" - different CPU generations behave differently, for example:

    • AMD Software Optimization Guide (Sep/2005), section 8.3, pg. 167:
      Avoid using the REP prefix when performing string operations, especially when copying blocks of memory.
    • AMD Software Optimization Guide (Apr/2011), section 9.3, pg. 148:
      Use the REP prefix judiciously when performing string operations.

    The Intel Architecture Optimization Manual gives performance comparison figures for various block copy techniques (including rep stosd) on Table 7-2. Relative Performance of Memory Copy Routines, pg. 7-37f., for different CPUs, and again what's fastest on one might not be fastest on others.

    For many cases, recent x86 CPUs (which have the "string" SSE4.2 operations) can do string operations via the SIMD unit, see this investigation.

    To follow up on all this (and/or keep yourself updated when things change again, inevitably), read Agner Fog's Optimization guides/blogs.

    0 讨论(0)
  • 2020-12-13 05:29

    Since no one has given you any numbers, I'll give you some which I've found by benchmarking my garbage collector which is very memcpy-heavy. My objects to be copied are 60% 16 bytes in length and the remainder 30% are 500 - 8000 bytes or so.

    • Precondition: Both dst , src and n are multiples of 8.
    • Processor: AMD Phenom(tm) II X6 1090T Processor 64bit/linux

    Here are my three memcpy variants:

    Hand-coded while-loop:

    if (n == 16) {
        *dst++ = *src++;
        *dst++ = *src++;
    } else {
        size_t n_ptrs = n / sizeof(ptr);
        ptr *end = dst + n_ptrs;
        while (dst < end) {
            *dst++ = *src++;
        }
    }
    

    (ptr is an alias to uintptr_t). Time: 101.16%

    rep movsb

    if (n == 16) {
        *dst++ = *src++;
        *dst++ = *src++;
    } else {
        asm volatile("cld\n\t"
                     "rep ; movsb"
                     : "=D" (dst), "=S" (src)
                     : "c" (n), "D" (dst), "S" (src)
                     : "memory");
    }
    

    Time: 103.22%

    rep movsq

    if (n == 16) {
        *dst++ = *src++;
        *dst++ = *src++;
    } else {
        size_t n_ptrs = n / sizeof(ptr);
        asm volatile("cld\n\t"
                     "rep ; movsq"
                     : "=D" (dst), "=S" (src)
                     : "c" (n_ptrs), "D" (dst), "S" (src)
                     : "memory");
    }
    

    Time: 100.00%

    req movsq wins by a tiny margin.

    0 讨论(0)
  • 2020-12-13 05:36

    In addition to FrankH's excellent answer; I'd like to point out that which method is best also depends on the length of the string, its alignment, and if the length is fixed or variable.

    For small strings (maybe up to about 16 bytes) doing it manually with simple instructions is probably faster, as it avoids the setup costs of more complex techniques (and for fixed size strings can be easily unrolled). For medium sized strings (maybe from 16 bytes to 4 KiB) something like "REP MOVSD" (with some "MOVSB" instructions thrown in if misalignment is possible) is likely to be best.

    For anything larger than that, some people would be tempted to go into SSE/AVX and prefetching, etc. A better idea is to fix the caller/s so that copying (or strlen() or whatever) isn't needed in the first place. If you try hard enough, you'll almost always find a way. Note: Also be very wary of "supposed" fast mempcy() routines - typically they've been tested on massive strings and not tested on far more likely tiny/small/medium strings.

    Also note that (for the purpose of optimisation rather than convenience) due to all these differences (likely length, alignment, fixed or variable size, CPU type, etc) the idea of having one multi-purpose "memcpy()" for all of the very different cases is near-sighted.

    0 讨论(0)
提交回复
热议问题