How to Calculate single-vector Dot Product using SSE intrinsic functions in C

后端 未结 4 967
攒了一身酷
攒了一身酷 2020-12-08 08:12

I am trying to multiply two vectors together where each element of one vector is multiplied by the element in the same index at the other vector. I then want to sum all the

相关标签:
4条回答
  • 2020-12-08 08:50

    I'd say the fastest SSE method would be:

    static inline float CalcDotProductSse(__m128 x, __m128 y) {
        __m128 mulRes, shufReg, sumsReg;
        mulRes = _mm_mul_ps(x, y);
    
        // Calculates the sum of SSE Register - https://stackoverflow.com/a/35270026/195787
        shufReg = _mm_movehdup_ps(mulRes);        // Broadcast elements 3,1 to 2,0
        sumsReg = _mm_add_ps(mulRes, shufReg);
        shufReg = _mm_movehl_ps(shufReg, sumsReg); // High Half -> Low Half
        sumsReg = _mm_add_ss(sumsReg, shufReg);
        return  _mm_cvtss_f32(sumsReg); // Result in the lower part of the SSE Register
    }
    

    I followed - Fastest Way to Do Horizontal Float Vector Sum On x86.

    0 讨论(0)
  • 2020-12-08 08:50

    I wrote this and compiled it with gcc -O3 -S -ftree-vectorize -ftree-vectorizer-verbose=2 sse.c

    void f(int * __restrict__ a, int * __restrict__ b, int * __restrict__ c, int * __restrict__ d,
           int * __restrict__ e, int * __restrict__ f, int * __restrict__ g, int * __restrict__ h,
           int * __restrict__ o)
    {
        int i;
    
        for (i = 0; i < 8; ++i)
            o[i] = a[i]*e[i] + b[i]*f[i] + c[i]*g[i] + d[i]*h[i];
    }
    

    And GCC 4.3.0 auto-vectorized it:

    sse.c:5: note: LOOP VECTORIZED.
    sse.c:2: note: vectorized 1 loops in function.
    

    However, it would only do that if I used a loop with enough iterations -- otherwise the verbose output would clarify that vectorization was unprofitable or the loop was too small. Without the __restrict__ keywords it has to generate separate, non-vectorized versions to deal with cases where the output o may point into one of the inputs.

    I would paste the instructions as an example, but since part of the vectorization unrolled the loop it's not very readable.

    0 讨论(0)
  • 2020-12-08 08:50

    There is an article by Intel here which touches on dot-product implementations.

    0 讨论(0)
  • 2020-12-08 09:01

    If you're doing a dot-product of longer vectors, use multiply and regular _mm_add_ps (or FMA) inside the inner loop. Save the horizontal sum until the end.


    But if you are doing a dot product of just a single pair of SIMD vectors:

    GCC (at least version 4.3) includes <smmintrin.h> with SSE4.1 level intrinsics, including the single and double-precision dot products:

    _mm_dp_ps (__m128 __X, __m128 __Y, const int __M);
    _mm_dp_pd (__m128d __X, __m128d __Y, const int __M);
    

    On Intel mainstream CPUs (not Atom/Silvermont) these are somewhat faster than doing it manually with multiple instructions.

    But on AMD (including Ryzen), dpps is significantly slower. (See Agner Fog's instruction tables)


    As a fallback for older processors, you can use this algorithm to create the dot product of the vectors a and b:

    __m128 r1 = _mm_mul_ps(a, b);
    

    and then horizontal sum r1 using Fastest way to do horizontal float vector sum on x86 (see there for a commented version of this, and why it's faster.)

    __m128 shuf   = _mm_shuffle_ps(r1, r1, _MM_SHUFFLE(2, 3, 0, 1));
    __m128 sums   = _mm_add_ps(r1, shuf);
    shuf          = _mm_movehl_ps(shuf, sums);
    sums          = _mm_add_ss(sums, shuf);
    float result =  _mm_cvtss_f32(sums);
    

    A slow alternative costs 2 shuffles per hadd, which will easily bottleneck on shuffle throughput, especially on Intel CPUs.

    r2 = _mm_hadd_ps(r1, r1);
    r3 = _mm_hadd_ps(r2, r2);
    _mm_store_ss(&result, r3);
    
    0 讨论(0)
提交回复
热议问题