How can I use SIMD to accelerate XOR two blocks of memory?

后端 未结 3 1215
心在旅途
心在旅途 2021-01-12 14:09

I want to XOR two blocks of memory as quickly as possible, How can I use SIMD to accelerate it?

My original code is below:

void region_xor_w64(   uns         


        
相关标签:
3条回答
  • 2021-01-12 14:30

    Okay, if intels prefer going forward and prefer pointer ops over indexes, then:

    void region_xor_w64(unsigned char *r1, unsigned char *r2, unsigned int i)
    {
        while (i--)
            *r2++ ^= *r1++;
    }
    

    Mike

    0 讨论(0)
  • 2021-01-12 14:33

    As the size of the region is passed by value why wouldn't the code be:

    void region_xor_w64(unsigned char *r1, unsigned char *r2, unsigned int i)
    {
        while (i--)
            r2[i] = r1[i] ^ r2[i];
    }
    

    or even:

    void region_xor_w64(unsigned char *r1, unsigned char *r2, unsigned int i)
    {
        while (i--)
            r2[i] ^= r1[i];
    }
    

    If there's a preference towards going forwards ('up memory') and for using pointers, then:

    void region_xor_w64(unsigned char *r1, unsigned char *r2, unsigned int i)
    {
        while (i--)
            *r2++ ^= *r1++;
    }
    
    0 讨论(0)
  • 2021-01-12 14:52

    The more important question is why would you want to do it manually. Do you have an ancient compiler that you think you can outsmart? Those good old times when you had to manually write SIMD instructions are over. Today, in 99% of cases compiler will do the job for you, and chances are than it will do a lot better job. Also, don't forget that there are new architectures coming out every once in a while with more and more extended instruction set. So ask yourself a question — do you want to maintain N copies of your implementation for each platform? Do you want to constantly test your implementation to make sure it is worth maintaining? Most likely the answer would be no.

    The only thing you need to do is to write the simplest possible code. Compiler will do the rest. For instance, here is how I would write your function:

    void region_xor_w64(unsigned char *r1, unsigned char *r2, unsigned int len)
    {
        unsigned int i;
        for (i = 0; i < len; ++i)
            r2[i] = r1[i] ^ r2[i];
    }
    

    A bit simpler, isn't it? And guess what, compiler is generating code that performs 128-bit XOR using MOVDQU and PXOR, the critical path looks like this:

    4008a0:       f3 0f 6f 04 06          movdqu xmm0,XMMWORD PTR [rsi+rax*1]
    4008a5:       41 83 c0 01             add    r8d,0x1
    4008a9:       f3 0f 6f 0c 07          movdqu xmm1,XMMWORD PTR [rdi+rax*1]
    4008ae:       66 0f ef c1             pxor   xmm0,xmm1
    4008b2:       f3 0f 7f 04 06          movdqu XMMWORD PTR [rsi+rax*1],xmm0
    4008b7:       48 83 c0 10             add    rax,0x10
    4008bb:       45 39 c1                cmp    r9d,r8d
    4008be:       77 e0                   ja     4008a0 <region_xor_w64+0x40>
    

    As @Mysticial has pointed out, the above code is using instruction that support unaligned access. Those are slower. If, however, a programmer can correctly assume an aligned access then it is possible to let compiler know about it. For example:

    void region_xor_w64(unsigned char * restrict r1,
                        unsigned char * restrict r2,
                        unsigned int len)
    {
        unsigned char * restrict p1 = __builtin_assume_aligned(r1, 16);
        unsigned char * restrict p2 = __builtin_assume_aligned(r2, 16);
    
        unsigned int i;
        for (i = 0; i < len; ++i)
            p2[i] = p1[i] ^ p2[i];
    }
    

    The compiler generates the following for the above C code (notice movdqa):

    400880:       66 0f 6f 04 06          movdqa xmm0,XMMWORD PTR [rsi+rax*1]
    400885:       41 83 c0 01             add    r8d,0x1
    400889:       66 0f ef 04 07          pxor   xmm0,XMMWORD PTR [rdi+rax*1]
    40088e:       66 0f 7f 04 06          movdqa XMMWORD PTR [rsi+rax*1],xmm0
    400893:       48 83 c0 10             add    rax,0x10
    400897:       45 39 c1                cmp    r9d,r8d
    40089a:       77 e4                   ja     400880 <region_xor_w64+0x20>
    

    Tomorrow, when I buy myself a laptop with a Haswell CPU, the compiler will generate me a code that use 256-bit instructions instead of 128-bit from the same code giving me twice the vector performance. It would do it even if I didn't know that Haswell is capable of it. You would have to not only know about that feature, but write another version of your code and spend some time testing it.

    By the way, it seems like you also have a bug in your implementation where the code can skip up to 3 remaining bytes in the data vector.

    At any rate, I would recommend you trust your compiler and learn how to verify what is generates (i.e. get familiar with objdump). The next choice would be to change the compiler. Only then start thinking about writing vector processing instructions manually. Or you gonna have a bad time!

    Hope it helps. Good Luck!

    0 讨论(0)
提交回复
热议问题