How to get data out of AVX registers?

Using MSVC 2013 and AVX 1, I've got 8 floats in a register:

__m256 foo = mm256_fmadd_ps(a,b,c);

Now I want to call inline void print(float) {...} for all 8 floats. It looks like the Intel AVX intrisics would make this rather complicated:

print(_castu32_f32(_mm256_extract_epi32(foo, 0)));
print(_castu32_f32(_mm256_extract_epi32(foo, 1)));
print(_castu32_f32(_mm256_extract_epi32(foo, 2)));
// ...

but MSVC doesn't even have either of these two intrinsics. Sure, I could write back the values to memory and load from there, but I suspect that at assembly level there's no need to spill a register.

Bonus Q: I'd of course like to write

for(int i = 0; i !=8; ++i) 
    print(_castu32_f32(_mm256_extract_epi32(foo, i)))

but MSVC doesn't understand that many intrinsics require loop unrolling. How do I write a loop over the 8x32 floats in __m256 foo?

Careful: _mm256_fmadd_ps isn't part of AVX1. FMA3 has its own feature bit, and was only introduced on Intel with Haswell. AMD introduced FMA3 with Piledriver (AVX1+FMA4+FMA3, no AVX2).

At the asm level, if you want to get eight 32bit elements into integer registers, it is actually faster to store to the stack and then do scalar loads. pextrd is a 2-uop instruction on SnB-family, and Bulldozer-family. (and Nehalem and Silvermont, which don't support AVX).

The only CPU where vextractf128 + 2xmovd + 6xpextrd isn't terrible is AMD Jaguar. (cheap pextrd, and only one load port.) (See Agner Fog's insn tables)

A wide aligned store can forward to overlapping narrow loads. (Of course, you can use movd to get the low element, so you have a mix of load port and ALU port uops).

Of course, you seem to be extracting floats by using an integer extract and then converting it back to a float. That seems horrible.

What you actually need is each float in the low element of its own xmm register. vextractf128 is obviously the way to start, bringing element 4 to the bottom of a new xmm reg. Then 6x AVX shufps can easily get the other three elements of each half. (Or movshdup and movhlps have shorter encodings: no immediate byte).

7 shuffle uops are worth considering vs. 1 store and 7 load uops, but not if you were going to spill the vector for a function call anyway.

ABI considerations:

You're on Windows, where xmm6-15 are call-preserved (only the low128; the upper halves of ymm6-15 are call-clobbered). This is yet another reason to start with vextractf128.

In the SysV ABI, all the xmm / ymm / zmm registers are call-clobbered, so every print() function requires a spill/reload. The only sane thing to do there is store to memory and call print with the original vector (i.e. print the low element, because it will ignore the rest of the register). Then movss xmm0, [rsp+4] and call print on the 2nd element, etc.

It does you no good to get all 8 floats nicely unpacked into 8 vector regs, because they'd all have to be spilled separately anyway before the first function call!

Assuming you only have AVX (i.e. no AVX2) then you could doing something like this:

float extract_float(const __m128 v, const int i)
{
    float x;
    _MM_EXTRACT_FLOAT(x, v, i);
    return x;
}

void print(const __m128 v)
{
    print(extract_float(v, 0));
    print(extract_float(v, 1));
    print(extract_float(v, 2));
    print(extract_float(v, 3));
}

void print(const __m256 v)
{
    print(_mm256_extractf128_ps(v, 0));
    print(_mm256_extractf128_ps(v, 1));
}

However I think I would probably just use a union:

union U256f {
    __m256 v;
    float a[8];
};

void print(const __m256 v)
{
    const U256f u = { v };

    for (int i = 0; i < 8; ++i)
        print(u.a[i]);
}

Peter Cordes

(Unfinished answer. Posting anyway in case it helps anyone, or in case I come back to it. Generally if you need to interface with scalar that you can't vectorize, it's not bad to just store a vector to a local array, and then reload it one element at a time.)

See my other answer for asm details. This answer is about the C++ side of things.

Using Agner Fog's Vector Class Library, his wrapper classes overload operator[] to work exactly the way you'd expect, even for non-constant args. This often compiles to a store/reload, but it makes it easy to write the code in C++. With optimization enabled, you'll probably get decent results. (except the low element might get stored/reloaded, instead of just getting used in place. So you might need to special-case vec[0] into _mm_cvtss_f32(vec) or something.)

See also my github repo with mostly-untested changes to Agner's VCL, to generate better code for some functions.

There's a _MM_EXTRACT_FLOAT wrapper macro, but it's weird and only defined with SSE4.1. I think it's intended to go with SSE4.1 extractps (which can extract the binary representation of a float into an integer register, or store to memory). It gcc does compile it into an FP shuffle when the destination is a float, though. Be careful that other compilers don't compile it to an actual extractps instruction if you want the result as a float, because that's not what extractps does. (That is what insertps does, but a simpler FP shuffle would take fewer instruction bytes. e.g. shufps with AVX is great.)

It's weird because it takes 3 args: _MM_EXTRACT_FLOAT(dest, src_m128, idx), so you can't even use it as an initializer for a float local.

To loop over a vector

gcc will unroll a loop like that for you, but only with -O1 or higher. At -O0, it will give you an error message.

float bad_hsum(__m128 & fv) {
    float sum = 0;
    for (int i=0 ; i<4 ; i++) {
        float f;
        _MM_EXTRACT_FLOAT(f, fv, i);  // works only with -O1 or higher
        sum += f;
    }
    return sum;
}

    float valueAVX(__m256 a, int i){

        float ret = 0;
        switch (i){

            case 0:
//                 a = ( a7, a6, a5, a4, a3, a2, a1, a0 )
// extractf(a, 0)      ( a3, a2, a1, a0 )
// cvtss_f32             a0 

                ret = _mm_cvtss_f32(_mm256_extractf128_ps(a, 0));
                break;
            case 1: {
//                     a = ( a7, a6, a5, a4, a3, a2, a1, a0 )
// extractf(a, 0)     lo = ( a3, a2, a1, a0 )
// shuffle(lo, lo, 1)      ( - , a3, a2, a1 )
// cvtss_f32                 a1 
                __m128 lo = _mm256_extractf128_ps(a, 0);
                ret = _mm_cvtss_f32(_mm_shuffle_ps(lo, lo, 1));
            }
                break;
            case 2: {
//                   a = ( a7, a6, a5, a4, a3, a2, a1, a0 )
// extractf(a, 0)   lo = ( a3, a2, a1, a0 )
// movehl(lo, lo)        ( - , - , a3, a2 )
// cvtss_f32               a2 
                __m128 lo = _mm256_extractf128_ps(a, 0);
                ret = _mm_cvtss_f32(_mm_movehl_ps(lo, lo));
            }
                break;
            case 3: {
//                   a = ( a7, a6, a5, a4, a3, a2, a1, a0 )
// extractf(a, 0)   lo = ( a3, a2, a1, a0 )
// shuffle(lo, lo, 3)    ( - , - , - , a3 )
// cvtss_f32               a3 
                __m128 lo = _mm256_extractf128_ps(a, 0);                    
                ret = _mm_cvtss_f32(_mm_shuffle_ps(lo, lo, 3));
            }
                break;

            case 4:
//                 a = ( a7, a6, a5, a4, a3, a2, a1, a0 )
// extractf(a, 1)      ( a7, a6, a5, a4 )
// cvtss_f32             a4 
                ret = _mm_cvtss_f32(_mm256_extractf128_ps(a, 1));
                break;
            case 5: {
//                     a = ( a7, a6, a5, a4, a3, a2, a1, a0 )
// extractf(a, 1)     hi = ( a7, a6, a5, a4 )
// shuffle(hi, hi, 1)      ( - , a7, a6, a5 )
// cvtss_f32                 a5 
                __m128 hi = _mm256_extractf128_ps(a, 1);
                ret = _mm_cvtss_f32(_mm_shuffle_ps(hi, hi, 1));
            }
                break;
            case 6: {
//                   a = ( a7, a6, a5, a4, a3, a2, a1, a0 )
// extractf(a, 1)   hi = ( a7, a6, a5, a4 )
// movehl(hi, hi)        ( - , - , a7, a6 )
// cvtss_f32               a6 
                __m128 hi = _mm256_extractf128_ps(a, 1);
                ret = _mm_cvtss_f32(_mm_movehl_ps(hi, hi));
            }
                break;
            case 7: {
//                   a = ( a7, a6, a5, a4, a3, a2, a1, a0 )
// extractf(a, 1)   hi = ( a7, a6, a5, a4 )
// shuffle(hi, hi, 3)    ( - , - , - , a7 )
// cvtss_f32               a7 
                __m128 hi = _mm256_extractf128_ps(a, 1);
                ret = _mm_cvtss_f32(_mm_shuffle_ps(hi, hi, 3));
            }
                break;
        }

        return ret;
    }

来源：https://stackoverflow.com/questions/37612455/how-to-get-data-out-of-avx-registers

标签

c++

visual-c++

avx

fma