avx

Conditional SSE/AVX add or zero elements based on compare

此生再无相见时 提交于 2021-02-04 21:40:18
问题 I have the following __m128 vectors: v_weight v_entropy I need to add v_entropy to v_weight only where elements in v_weight are not 0f. Obviously _mm_add_ps() adds all elements regardless. I can compile up to AVX, but not AVX2. EDIT I do know beforehand how many elements in v_weight will be 0 (there will always be either 0 or the last 1, 2, or 3 elements). If it's easier, how do I zero-out the corresponding elements in v_entropy ? 回答1: The cmpeq/cmpgt instructions create a mask, all ones or

Reverse byte order in XMM or YMM register?

你说的曾经没有我的故事 提交于 2021-02-04 06:30:06
问题 Let's say I want to reverse the byte order of a very large byte array. I can do this the slow way using the main registers but I would like to speed it up using the XMM or YMM registers. Is there a way to reverse the byte order in an XMM or YMM register? 回答1: Yes, use SSSE3 _mm_shuffle_epi8 or AVX2 _mm256_shuffle_epi8 to shuffle bytes within 16-byte AVX2 "lanes". Depending on the shuffle control vector, you can swap pairs of bytes, reverse 4-byte units, or reverse 8-byte units. Or reverse all

SIMD: Accumulate Adjacent Pairs

青春壹個敷衍的年華 提交于 2021-02-02 09:29:36
问题 I'm learning how to use SIMD intrinsics and autovectorization. Luckily, I have a useful project I'm working on that seems extremely amenable to SIMD, but is still tricky for a newbie like me. I'm writing a filter for images that computes the average of 2x2 pixels. I'm doing part of the computation by accumulating the sum of two pixels into a single pixel. template <typename T, typename U> inline void accumulate_2x2_x_pass( T* channel, U* accum, const size_t sx, const size_t sy, const size_t

AVX intrinsics for tiled matrix multiplication [closed]

我怕爱的太早我们不能终老 提交于 2021-01-29 13:18:23
问题 Closed. This question needs debugging details. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 1 year ago . Improve this question I was trying to use AVX512 intrinsics to vectorize my loop of matrix multiplication (tiled). I used __mm256d as variables to store intermediate results and store them in my results. However, somehow this triggers memory corruption. I've got no hint why this is the case, as the non

Why AVX dot product slower than native C++ code

Deadly 提交于 2021-01-28 14:04:54
问题 I have the following AVX and Native codes: __forceinline double dotProduct_2(const double* u, const double* v) { _mm256_zeroupper(); __m256d xy = _mm256_mul_pd(_mm256_load_pd(u), _mm256_load_pd(v)); __m256d temp = _mm256_hadd_pd(xy, xy); __m128d dotproduct = _mm_add_pd(_mm256_extractf128_pd(temp, 0), _mm256_extractf128_pd(temp, 1)); return dotproduct.m128d_f64[0]; } __forceinline double dotProduct_1(const D3& a, const D3& b) { return a[0] * b[0] + a[1] * b[1] + a[2] * b[2] + a[3] * b[3]; }

Does MSVC 2017 support automatic CPU dispatch?

一曲冷凌霜 提交于 2021-01-27 07:30:39
问题 I read on a few sites that MSVC can actually emit say AVX instructions, when SSE2 architecture is used and detect the AVX support runtime. Is it true? I tested various loops that would definitely benefit from AVX/AVX2 support, but when run in debugger I couldn't really find any AVX instructions. When /arch:AVX is used, then it emits AVX instructions, but it of course crashes on CPUs that doesn't support it (tested), so no runtime detection either. I could use AVX intrinsics though and it

Multiply-add vectorization slower with AVX than with SSE

雨燕双飞 提交于 2021-01-27 06:01:40
问题 I have a piece of code that is being run under a heavily contended lock, so it needs to be as fast as possible. The code is very simple - it's a basic multiply-add on a bunch of data which looks like this: for( int i = 0; i < size; i++ ) { c[i] += (double)a[i] * (double)b[i]; } Under -O3 with enabled SSE support the code is being vectorized as I would expect it to be. However, with AVX code generation turned on I get about 10-15% slowdown instead of speedup, and I can't figure out why. Here's

Comparing 2 vectors in AVX/AVX2 (c)

和自甴很熟 提交于 2021-01-20 07:12:20
问题 I have two __m256i vectors (each containing chars), and I want to find out if they are completely identical or not. All I need is true if all bits are equal, and 0 otherwise. What's the most efficient way of doing that? Here's the code loading the arrays: char * a1 = "abcdefhgabcdefhgabcdefhgabcdefhg"; __m256i r1 = _mm256_load_si256((__m256i *) a1); char * a2 = "abcdefhgabcdefhgabcdefhgabcdefhg"; __m256i r2 = _mm256_load_si256((__m256i *) a2); 回答1: The most efficient way on current Intel and

Comparing 2 vectors in AVX/AVX2 (c)

筅森魡賤 提交于 2021-01-20 07:11:50
问题 I have two __m256i vectors (each containing chars), and I want to find out if they are completely identical or not. All I need is true if all bits are equal, and 0 otherwise. What's the most efficient way of doing that? Here's the code loading the arrays: char * a1 = "abcdefhgabcdefhgabcdefhgabcdefhg"; __m256i r1 = _mm256_load_si256((__m256i *) a1); char * a2 = "abcdefhgabcdefhgabcdefhgabcdefhg"; __m256i r2 = _mm256_load_si256((__m256i *) a2); 回答1: The most efficient way on current Intel and

Writing a portable SSE/AVX version of std::copysign

蹲街弑〆低调 提交于 2021-01-18 12:07:07
问题 I am currently writing a vectorized version of the QR decomposition (linear system solver) using SSE and AVX intrinsics. One of the substeps requires to select the sign of a value opposite/equal to another value. In the serial version, I used std::copysign for this. Now I want to create a similar function for SSE/AVX registers. Unfortunately, the STL uses a built-in function for that, so I can't just copy the code and turn it into SSE/AVX instructions. I have not tried it yet (so I have no