How to swap two __m128i variables in C++03 given its an opaque type and an array?

99封情书 提交于 2019-12-01 08:35:34

This doesn't sound like a best-practices issue; it sounds like you need a workaround for a seriously broken implementation of intrinsics. If __m128i tmp = a; doesn't compile, that's pretty bad.


If you're going to write a custom swap function, keep it simple. __m128i is a POD type that fits in a single vector register. Don't do anything that will encourage the compiler to spill it to memory. Some compilers will generate really horrible code even for a trivial test-case, and even gcc/clang might trip over a memcpy as part of optimizing a big complicated function.

Since the compiler is choking on the constructor, just declare a tmp variable with a normal initializer, and use = assignment to do the copying. That always works efficiently in any compiler that supports __m128i, and is a common pattern.

Plain assignment to/from values in memory works like _mm_store_si128 / _mm_load_si128: i.e. movdqa aligned stores/loads that will fault if used on unaligned addresses. (Of course, optimization can result in loads getting folded into memory operands to another vector instruction, or stores not happening at all.)

// alternate names: assignment_swap
// or swap128, but then the name doesn't fit for __m256i...

// __m128i t(a) errors, so just use simple initializers / assignment
template<class T>
void vecswap(T& a, T& b) {
    // T t = a;     // Apparently SunCC even choked on this
    T t;
    t = a;
    a = b;
    b = t;
}

Test cases: optimal code even with a crusty compiler like ICC13 which does a terrible job with the memcpy version. asm output from the Godbolt compiler explorer, with icc13 -O3

__m128i test_return2nd(__m128i x, __m128i y) {
    vecswap(x, y);
    return x;
}

    movdqa    xmm0, xmm1
    ret                    # returning the 2nd arg, which was in xmm1


__m128i test_return1st(__m128i x, __m128i y) {
    vecswap(x, y);
    return y;
}

    ret                   # returning the first arg, already in xmm0

With memswap, you get something like

return1st_memcpy(__m128i, __m128i):        ## ICC13 -O3
    movdqa    XMMWORD PTR [-56+rsp], xmm0
    movdqa    XMMWORD PTR [-40+rsp], xmm1    # spill both
    movaps    xmm2, XMMWORD PTR [-56+rsp]    # reload x
    movaps    XMMWORD PTR [-24+rsp], xmm2    # copy x to tmp
    movaps    xmm0, XMMWORD PTR [-40+rsp]    # reload y
    movaps    XMMWORD PTR [-56+rsp], xmm0    # copy y to x
    movaps    xmm0, XMMWORD PTR [-24+rsp]    # reload tmp
    movaps    XMMWORD PTR [-40+rsp], xmm0    # copy tmp to y
    movdqa    xmm0, XMMWORD PTR [-40+rsp]    # reload y
    ret                                      # return y

This is pretty much the absolute maximum amount of spilling/reloading you could imagine to swap two registers, because icc13 doesn't optimize between the inlined memcpys at all, or even remember what is left in a register.


Swapping values already in memory

Even gcc makes worse code with the memcpy version. It does the copy with 64bit integer loads/stores instead of a 128bit vector load/store. This is terrible if you're about to load the vector (store-forwarding stall), and otherwise is just bad (more uops to do the same work).

// the memcpy version of this compiles badly
void test_mem(__m128i *x, __m128i *y) {
    vecswap(*x, *y);
}
    # gcc 5.3 and ICC13 make the same code here, since it's easy to optimize
    movdqa  xmm0, XMMWORD PTR [rdi]
    movdqa  xmm1, XMMWORD PTR [rsi]
    movaps  XMMWORD PTR [rdi], xmm1
    movaps  XMMWORD PTR [rsi], xmm0
    ret

// gcc 5.3 with memswap instead of vecswap.  ICC13 is similar
test_mem_memcpy(long long __vector(2)*, long long __vector(2)*):
    mov     rax, QWORD PTR [rdi]
    mov     rdx, QWORD PTR [rdi+8]
    mov     r9, QWORD PTR [rsi]
    mov     r10, QWORD PTR [rsi+8]
    mov     QWORD PTR [rdi], r9
    mov     QWORD PTR [rdi+8], r10
    mov     QWORD PTR [rsi], rax
    mov     QWORD PTR [rsi+8], rdx
    ret

swap via memcpy?

#include <emmintrin.h>
#include <cstring>

template<class T>
void memswap(T& a, T& b)
{
    T t;
    std::memcpy(&t, &a, sizeof(t));
    std::memcpy(&a, &b, sizeof(t));
    std::memcpy(&b, &t, sizeof(t));
}

int main() {
    __m128i x;
    __m128i y;
    memswap(x, y);
    return 0;
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!