sse4 | 易学教程

How to simulate pcmpgtq on sse2?

阅读更多关于 How to simulate pcmpgtq on sse2?

问题 PCMPGTQ was introduced in sse4.2, and it provides a greater than signed comparison for 64 bit numbers that yields a mask. How does one support this functionality on instructions sets predating sse4.2? Update: This same question applies to ARMv7 with Neon which also lacks a 64-bit comparator. The sister question to this is found here: What is the most efficient way to support CMGT with 64bit signed comparisons on ARMv7a with Neon? 回答1: __m128i pcmpgtq_sse2 (__m128i a, __m128i b) { __m128i r =

How to simulate pcmpgtq on sse2?

阅读更多关于 How to simulate pcmpgtq on sse2?

How to simulate pcmpgtq on sse2?

阅读更多关于 How to simulate pcmpgtq on sse2?

How to simulate pcmpgtq on sse2?

阅读更多关于 How to simulate pcmpgtq on sse2?

Generate code for multiple SIMD architectures

阅读更多关于 Generate code for multiple SIMD architectures

问题 I have written a library, where I use CMake for verifying the presence of headers for MMX, SSE, SSE2, SSE4, AVX, AVX2, and AVX-512. In addition to this, I check for the presence of the instructions and if present, I add the necessary compiler flags, -msse2 -mavx -mfma etc. This is all very good, but I would like to deploy a single binary, which works across a range of generations of processors. Question: Is it possible to tell the compiler (GCC) that whenever it optimizes a function using

SSE42 & STTNI - PcmpEstrM is twice slower than PcmpIstrM, is it true?

阅读更多关于 SSE42 & STTNI - PcmpEstrM is twice slower than PcmpIstrM, is it true?

问题 I'm experimenting with SSE42 and STTNI instructions and have got strange result - PcmpEstrM (works with explicit length strings) runs twice slower than PcmpIstrM (implicit length strings). On my i7 3610QM the difference is 2366.2 ms vs. 1202.3 ms - 97% . On i5 3470 difference is not so huge, but is still significant = 3206.2 ms vs. 2623.2 ms - 22% . Both are "Ivy Bridge" - it is strange that they have so different "difference" (at least i can't see any technical differences in their specs -

How to compare more than two numbers in parallel?

阅读更多关于 How to compare more than two numbers in parallel?

问题 Is it possible to compare more than a pair of numbers in one instruction using SSE4? Intel Reference says the following about PCMPGTQ PCMPGTQ — Compare Packed Data for Greater Than Performs an SIMD compare for the packed quadwords in the destination operand (first operand) and the source operand (second operand). If the data element in the first (destination) operand is greater than the corresponding element in the second (source) operand, the corresponding data element in the destination is

MOVDQU instruction + page boundary

阅读更多关于 MOVDQU instruction + page boundary

问题 I have a simple test program that loads an xmm register with the movdqu instruction accessing data across a page boundary (OS = Linux). If the following page is mapped, this works just fine. If it's not mapped then I get a SIGSEGV, which is probably expected. However this diminishes the usefulness of the unaligned loads quite a bit. Additionally SSE4.2 instructions (like pcmpistri) which allow for unaligned memory references appear to exhibit this behavior as well. That's all fine -- except

MOVDQU instruction + page boundary

阅读更多关于 MOVDQU instruction + page boundary

fast compact register using sse

阅读更多关于 fast compact register using sse

问题 I am trying to figure out how to use sse _mm_shuffle_epi8 to compact a 128-bit register. Let's say, I have an input variable __m128i target which is basically 8 16-bits, indicated as: a[0], a[1] ... a[7]. // each slot is 16 bits my output is called: __m128i output Now I have a bit-vector of size 8: char bit_mask // 8 bits, i-th bit each indicate if // the corresponding a[i] should be included OK, how can I get the final result based on the bit_mask and the input target? assume my bitvector is