问题
I have the following AVX and Native codes:
__forceinline double dotProduct_2(const double* u, const double* v)
{
_mm256_zeroupper();
__m256d xy = _mm256_mul_pd(_mm256_load_pd(u), _mm256_load_pd(v));
__m256d temp = _mm256_hadd_pd(xy, xy);
__m128d dotproduct = _mm_add_pd(_mm256_extractf128_pd(temp, 0), _mm256_extractf128_pd(temp, 1));
return dotproduct.m128d_f64[0];
}
__forceinline double dotProduct_1(const D3& a, const D3& b)
{
return a[0] * b[0] + a[1] * b[1] + a[2] * b[2] + a[3] * b[3];
}
And respective test scripts:
std::cout << res_1 << " " << res_2 << " " << res_3 << '\n';
{
std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < (1 << 30); ++i)
{
zx_1 += dotProduct_1(aVx[i % 10000], aVx[(i + 1) % 10000]);
}
std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();
std::cout << "NAIVE : " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << '\n';
}
{
std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < (1 << 30); ++i)
{
zx_2 += dotProduct_2(&aVx[i % 10000][0], &aVx[(i + 1) % 10000][0]);
}
std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();
std::cout << "AVX : " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << '\n';
}
std::cout << math::min2(zx_1, zx_2) << " " << zx_1 << " " << zx_2;
Well, all of the data are aligned by 32. (D3 with __declspec... and aVx arr with _mm_malloc()..) And, as i can see, native variant is equal/or faster than AVX variant. I can't understand it's nrmally behaviour ? Because i'm think that AVX is 'super FAST' ... If not, how i can optimize it ? I compile it on MSVC 2015(x64), with arch AVX. Also, my hardwre is intel i7 4750HQ(haswell)
回答1:
You add too much overhead with vzeroupper and hadd instructions. Good way to write it, is to do all multiplies in a loop and aggregate the result just once at the end. Imagine you unroll original loop 4 times and use 4 accumulators:
for(i=0; i < (1<<30); i+=4) {
s0 += a[i+0] * b[i+0];
s1 += a[i+1] * b[i+1];
s2 += a[i+2] * b[i+2];
s3 += a[i+3] * b[i+3];
}
return s0+s1+s2+s3;
And now just replace unrolled loop with SIMD mul and add (or even FMA intrinsic if available)
回答2:
Simple profiling with basic loops isn't a great idea - it usually just means you are memory bandwidth limited, so the tests end up coming out at about the same speed (memory is typically slower than the CPU, and that's basically all you are testing here).
As others have said, your code example isn't great, because you are constantly going across the lanes (which I assume is just to find the fastest dot product, and not specifically because a sum of all the dot products is the desired result?). To be honest, if you really need a fast dot product (for AOS data as presented here), I think I would prefer to replace the VHADDPD with a VADDPD + VPERMILPD (trading an additional instruction for twice the throughput, and a lower latency)
double dotProduct_3(const double* u, const double* v)
{
__m256d dp = _mm256_mul_pd(_mm256_load_pd(u), _mm256_load_pd(v));
__m128d a = _mm256_extractf128_pd(dp, 0);
__m128d b = _mm256_extractf128_pd(dp, 1);
__m128d c = _mm_add_pd(a, b);
__m128d yy = _mm_unpackhi_pd(c, c);
__m128d dotproduct = _mm_add_pd(c, yy);
return _mm_cvtsd_f64(dotproduct);
}
asm:
dotProduct_3(double const*, double const*):
vmovapd ymm0,YMMWORD PTR [rsi]
vmulpd ymm0,ymm0,YMMWORD PTR [rdi]
vextractf128 xmm1,ymm0,0x1
vaddpd xmm0,xmm1,xmm0
vpermilpd xmm1,xmm0,0x3
vaddpd xmm0,xmm1,xmm0
vzeroupper
ret
Generally speaking, if you are using horizontal adds, you're doing it wrong! Whilst a 256bit register may seem ideal for a Vector4d, it's not actually a particularly great representation (especially if you consider that AVX512 is now available!). A very similar question to this came up recently: For C++ Vector3 utility class implementations, is array faster than struct and class?
If you want performance, then structure-of-arrays is the best way to go.
struct HybridVec4SOA
{
__m256d x;
__m256d y;
__m256d z;
__m256d w;
};
__m256d dot(const HybridVec4SOA& a, const HybridVec4SOA& b)
{
return _mm256_fmadd_pd(a.w, b.w,
_mm256_fmadd_pd(a.z, b.z,
_mm256_fmadd_pd(a.y, b.y,
_mm256_mul_pd(a.x, b.x))));
}
asm:
dot(HybridVec4SOA const&, HybridVec4SOA const&):
vmovapd ymm1,YMMWORD PTR [rdi+0x20]
vmovapd ymm2,YMMWORD PTR [rdi+0x40]
vmovapd ymm3,YMMWORD PTR [rdi+0x60]
vmovapd ymm0,YMMWORD PTR [rsi]
vmulpd ymm0,ymm0,YMMWORD PTR [rdi]
vfmadd231pd ymm0,ymm1,YMMWORD PTR [rsi+0x20]
vfmadd231pd ymm0,ymm2,YMMWORD PTR [rsi+0x40]
vfmadd231pd ymm0,ymm3,YMMWORD PTR [rsi+0x60]
ret
If you compare the latencies (and more importantly throughput) of load/mul/fmadd compared to hadd and extract, and then consider that the SOA version is computing 4 dot products at a time (instead of 1), you'll start to understand why it's the way to go...
来源:https://stackoverflow.com/questions/46502748/why-avx-dot-product-slower-than-native-c-code