Enhanced REP MOVSB for memcpy

前端 未结 6 1039
别跟我提以往
别跟我提以往 2020-11-22 02:04

I would like to use enhanced REP MOVSB (ERMSB) to get a high bandwidth for a custom memcpy.

ERMSB was introduced with the Ivy Bridge microarchitecture

6条回答
  •  难免孤独
    2020-11-22 02:33

    This is not an answer to the stated question(s), only my results (and personal conclusions) when trying to find out.

    In summary: GCC already optimizes memset()/memmove()/memcpy() (see e.g. gcc/config/i386/i386.c:expand_set_or_movmem_via_rep() in the GCC sources; also look for stringop_algs in the same file to see architecture-dependent variants). So, there is no reason to expect massive gains by using your own variant with GCC (unless you've forgotten important stuff like alignment attributes for your aligned data, or do not enable sufficiently specific optimizations like -O2 -march= -mtune=). If you agree, then the answers to the stated question are more or less irrelevant in practice.

    (I only wish there was a memrepeat(), the opposite of memcpy() compared to memmove(), that would repeat the initial part of a buffer to fill the entire buffer.)


    I currently have an Ivy Bridge machine in use (Core i5-6200U laptop, Linux 4.4.0 x86-64 kernel, with erms in /proc/cpuinfo flags). Because I wanted to find out if I can find a case where a custom memcpy() variant based on rep movsb would outperform a straightforward memcpy(), I wrote an overly complicated benchmark.

    The core idea is that the main program allocates three large memory areas: original, current, and correct, each exactly the same size, and at least page-aligned. The copy operations are grouped into sets, with each set having distinct properties, like all sources and targets being aligned (to some number of bytes), or all lengths being within the same range. Each set is described using an array of src, dst, n triplets, where all src to src+n-1 and dst to dst+n-1 are completely within the current area.

    A Xorshift* PRNG is used to initialize original to random data. (Like I warned above, this is overly complicated, but I wanted to ensure I'm not leaving any easy shortcuts for the compiler.) The correct area is obtained by starting with original data in current, applying all the triplets in the current set, using memcpy() provided by the C library, and copying the current area to correct. This allows each benchmarked function to be verified to behave correctly.

    Each set of copy operations is timed a large number of times using the same function, and the median of these is used for comparison. (In my opinion, median makes the most sense in benchmarking, and provides sensible semantics -- the function is at least that fast at least half the time.)

    To avoid compiler optimizations, I have the program load the functions and benchmarks dynamically, at run time. The functions all have the same form, void function(void *, const void *, size_t) -- note that unlike memcpy() and memmove(), they return nothing. The benchmarks (named sets of copy operations) are generated dynamically by a function call (that takes the pointer to the current area and its size as parameters, among others).

    Unfortunately, I have not yet found any set where

    static void rep_movsb(void *dst, const void *src, size_t n)
    {
        __asm__ __volatile__ ( "rep movsb\n\t"
                             : "+D" (dst), "+S" (src), "+c" (n)
                             :
                             : "memory" );
    }
    

    would beat

    static void normal_memcpy(void *dst, const void *src, size_t n)
    {
        memcpy(dst, src, n);
    }
    

    using gcc -Wall -O2 -march=ivybridge -mtune=ivybridge using GCC 5.4.0 on aforementioned Core i5-6200U laptop running a linux-4.4.0 64-bit kernel. Copying 4096-byte aligned and sized chunks comes close, however.

    This means that at least thus far, I have not found a case where using a rep movsb memcpy variant would make sense. It does not mean there is no such case; I just haven't found one.

    (At this point the code is a spaghetti mess I'm more ashamed than proud of, so I shall omit publishing the sources unless someone asks. The above description should be enough to write a better one, though.)


    This does not surprise me much, though. The C compiler can infer a lot of information about the alignment of the operand pointers, and whether the number of bytes to copy is a compile-time constant, a multiple of a suitable power of two. This information can, and will/should, be used by the compiler to replace the C library memcpy()/memmove() functions with its own.

    GCC does exactly this (see e.g. gcc/config/i386/i386.c:expand_set_or_movmem_via_rep() in the GCC sources; also look for stringop_algs in the same file to see architecture-dependent variants). Indeed, memcpy()/memset()/memmove() has already been separately optimized for quite a few x86 processor variants; it would quite surprise me if the GCC developers had not already included erms support.

    GCC provides several function attributes that developers can use to ensure good generated code. For example, alloc_align (n) tells GCC that the function returns memory aligned to at least n bytes. An application or a library can choose which implementation of a function to use at run time, by creating a "resolver function" (that returns a function pointer), and defining the function using the ifunc (resolver) attribute.

    One of the most common patterns I use in my code for this is

    some_type *pointer = __builtin_assume_aligned(ptr, alignment);
    

    where ptr is some pointer, alignment is the number of bytes it is aligned to; GCC then knows/assumes that pointer is aligned to alignment bytes.

    Another useful built-in, albeit much harder to use correctly, is __builtin_prefetch(). To maximize overall bandwidth/efficiency, I have found that minimizing latencies in each sub-operation, yields the best results. (For copying scattered elements to consecutive temporary storage, this is difficult, as prefetching typically involves a full cache line; if too many elements are prefetched, most of the cache is wasted by storing unused items.)

提交回复
热议问题