avx

Do all CPUs which support AVX2 also support SSE4.2 and AVX?

自作多情 提交于 2020-07-29 12:06:11
问题 I am planning to implement runtime detection of SIMD extensions. Is it such that if I find out that the processor has AVX2 support, it is also guaranteed to have SSE4.2 and AVX support? 回答1: Support for a more-recent Intel SIMD ISA extension implies support for previous SIMD ones. AVX2 definitely implies AVX1. I think AVX1 implies all of SSE/SSE2/SSE3/SSSE3/SSE4.1/SSE4.2 feature bits must also be set in CPUID. If not formally guaranteed, many things make this assumption and a CPU that

Largest data type which can be fetch-ANDed atomically?

拟墨画扇 提交于 2020-07-03 06:27:32
问题 I wanted to try and atomically reset 256 bits using something like this: #include <x86intrin.h> #include <iostream> #include <array> #include <atomic> int main(){ std::array<std::atomic<__m256i>, 10> updateArray; __m256i allZeros = _mm256_setzero_si256(); updateArray[0].fetch_and(allZeros); } but I get compiler errors about the element not having fetch_and() . Is this not possible because 256 bit type is too large to guarantee atomicity? Is there any other way I can implement this? I am using

Which is the reason for avx floating point bitwise logical operations?

吃可爱长大的小学妹 提交于 2020-05-27 04:25:47
问题 AVX allow for bitwise logical operations such as and/or on floating point data-type __m256 and __m256d. However, C++ doesn't allow for bitwise operations on floats and doubles, reasonably. If I'm right, there's no guarantee on the internal representation of floats, whether the compiler will use IEEE754 or not, hence a programmer can't be sure about how the bits of a float will look like. Consider this example: #include <immintrin.h> #include <iostream> #include <limits> #include <cassert> int

Generate code for multiple SIMD architectures

梦想的初衷 提交于 2020-05-09 19:44:25
问题 I have written a library, where I use CMake for verifying the presence of headers for MMX, SSE, SSE2, SSE4, AVX, AVX2, and AVX-512. In addition to this, I check for the presence of the instructions and if present, I add the necessary compiler flags, -msse2 -mavx -mfma etc. This is all very good, but I would like to deploy a single binary, which works across a range of generations of processors. Question: Is it possible to tell the compiler (GCC) that whenever it optimizes a function using

How to load a avx-512 zmm register from a ioremap() address?

跟風遠走 提交于 2020-04-16 02:58:10
问题 My goal is to create a PCIe transaction with more than 64b payload. For that I need to read an ioremap() address. For 128b and 256b I can use xmm and ymm registers respectively and that works as expected. Now, I'd like to do the same for 512b zmm registers (memory-like storage?!) A code under license I'm not allowed to show here, uses assembly code for 256b: void __iomem *addr; uint8_t datareg[32]; [...] // Read memory address to ymm (to have 256b at once): asm volatile("vmovdqa %0,%%ymm1" :

AVX 256-bit vectors slightly slower than scalar (~10%) for STREAM-like double add loop on huge arrays, on Xeon Gold

回眸只為那壹抹淺笑 提交于 2020-04-11 04:56:06
问题 I am new to AVX512 instruction set and I write the following code as demo. #include <iostream> #include <array> #include <chrono> #include <vector> #include <cstring> #include <omp.h> #include <immintrin.h> #include <cstdlib> int main() { unsigned long m, n, k; m = n = k = 1 << 30; auto *a = static_cast<double*>(aligned_alloc(512, m*sizeof(double))); auto *b = static_cast<double*>(aligned_alloc(512, n*sizeof(double))); auto *c = static_cast<double*>(aligned_alloc(512, k*sizeof(double)));

Why both? vperm2f128 (avx) vs vperm2i128 (avx2)

北城以北 提交于 2020-04-09 17:57:16
问题 avx introduced the instruction vperm2f128 (exposed via _mm256_permute2f128_si256 ), while avx2 introduced vperm2i128 (exposed via _mm256_permute2x128_si256 ). They both seem to be doing exactly the same, and their respective latencies and throughputs also seem to be identical. So why do both instructions exist? There has to be some reasoning behind that? Is there maybe something I have overlooked? Given that avx2 operates on data structures introduced with avx, I cannot imagine that a

Why both? vperm2f128 (avx) vs vperm2i128 (avx2)

亡梦爱人 提交于 2020-04-09 17:57:08
问题 avx introduced the instruction vperm2f128 (exposed via _mm256_permute2f128_si256 ), while avx2 introduced vperm2i128 (exposed via _mm256_permute2x128_si256 ). They both seem to be doing exactly the same, and their respective latencies and throughputs also seem to be identical. So why do both instructions exist? There has to be some reasoning behind that? Is there maybe something I have overlooked? Given that avx2 operates on data structures introduced with avx, I cannot imagine that a

Matrix transpose and population count

点点圈 提交于 2020-03-16 07:27:31
问题 I have a square boolean matrix M of size N, stored by rows and I want to count the number of bits set to 1 for each column. For instance for n=4: 1101 0101 0001 1001 M stored as { { 1,1,0,1}, {0,1,0,1}, {0,0,0,1}, {1,0,0,1} }; result = { 2, 2, 0, 4}; I can obviously transpose the matrix M into a matrix M' popcount each row of M'. Good algorithms exist for matrix transposition and popcounting through bit manipulation. My question is: would it be possible to "merge" such algorithms into a