AVX2: Computing dot product of 512 float arrays
问题 I will preface this by saying that I am a complete beginner at SIMD intrinsics. Essentially, I have a CPU which supports the AVX2 instrinsic ( Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz ). I would like to know the fastest way to compute the dot product of two std::vector<float> of size 512 . I have done some digging online and found this and this, and this stack overflow question suggests using the following function __m256 _mm256_dp_ps(__m256 m1, __m256 m2, const int mask); , However, these