sse4

_mm_testc_ps and _mm_testc_pd vs _mm_testc_si128

一个人想着一个人 提交于 2019-12-02 04:55:45
As you know, the first two are AVX-specific intrinsics and the second is a SSE4.1 intrinsic. Both sets of intrinsics can be used to check for equality of 2 floating-point vectors. My specific use case is: _mm_cmpeq_ps or _mm_cmpeq_pd , followed by _mm_testc_ps or _mm_testc_pd on the result, with an appropriate mask But AVX provides equivalents for "legacy" intrinsics, so I might be able to use _mm_testc_si128 , after a cast of the result to __m128i . My questions are, which of the two use cases results in better performance and where I can find out what legacy SSE instructions are provided by

SSE42 & STTNI - PcmpEstrM is twice slower than PcmpIstrM, is it true?

送分小仙女□ 提交于 2019-12-01 16:36:42
I'm experimenting with SSE42 and STTNI instructions and have got strange result - PcmpEstrM (works with explicit length strings) runs twice slower than PcmpIstrM (implicit length strings). On my i7 3610QM the difference is 2366.2 ms vs. 1202.3 ms - 97% . On i5 3470 difference is not so huge, but is still significant = 3206.2 ms vs. 2623.2 ms - 22% . Both are "Ivy Bridge" - it is strange that they have so different "difference" (at least i can't see any technical differences in their specs - http://www.cpu-world.com/Compare_CPUs/Intel_AW8063801013511,Intel_CM8063701093302/ ). Intel 64 and IA-32

MOVDQU instruction + page boundary

荒凉一梦 提交于 2019-12-01 15:26:49
I have a simple test program that loads an xmm register with the movdqu instruction accessing data across a page boundary (OS = Linux). If the following page is mapped, this works just fine. If it's not mapped then I get a SIGSEGV, which is probably expected. However this diminishes the usefulness of the unaligned loads quite a bit. Additionally SSE4.2 instructions (like pcmpistri) which allow for unaligned memory references appear to exhibit this behavior as well. That's all fine -- except there's many an implementation of strcmp using pcmpistri that I've found that don't seem to address this

How much faster are SSE4.2 string instructions than SSE2 for memcmp?

点点圈 提交于 2019-12-01 10:46:44
Here is my code's assembler Can you embed it in c ++ and check against SSE4? At speed I would very much like to see how stepped into the development of SSE4. Or is not worried about him at all? Let's check (I do not have support above SSSE3) { sse2 strcmp WideChar 32 bit } function CmpSee2(const P1, P2: Pointer; len: Integer): Boolean; asm push ebx // Create ebx cmp EAX, EDX // Str = Str2 je @@true // to exit true test eax, eax // not Str je @@false // to exit false test edx, edx // not Str2 je @@false // to exit false sub edx, eax // Str2 := Str2 - Str; mov ebx, [eax] // get Str 4 byte xor

How much faster are SSE4.2 string instructions than SSE2 for memcmp?

心不动则不痛 提交于 2019-12-01 08:20:58
问题 Here is my code's assembler Can you embed it in c ++ and check against SSE4? At speed I would very much like to see how stepped into the development of SSE4. Or is not worried about him at all? Let's check (I do not have support above SSSE3) { sse2 strcmp WideChar 32 bit } function CmpSee2(const P1, P2: Pointer; len: Integer): Boolean; asm push ebx // Create ebx cmp EAX, EDX // Str = Str2 je @@true // to exit true test eax, eax // not Str je @@false // to exit false test edx, edx // not Str2

_mm_crc32_u64 poorly defined

核能气质少年 提交于 2019-11-30 20:40:50
Why in the world was _mm_crc32_u64(...) defined like this? unsigned int64 _mm_crc32_u64( unsigned __int64 crc, unsigned __int64 v ); The "crc32" instruction always accumulates a 32-bit CRC, never a 64-bit CRC (It is, after all, CRC32 not CRC64). If the machine instruction CRC32 happens to have a 64-bit destination operand, the upper 32 bits are ignored, and filled with 0's on completion, so there is NO use to EVER have a 64-bit destination. I understand why Intel allowed a 64-bit destination operand on the instruction (for uniformity), but if I want to process data quickly, I want a source

_mm_crc32_u64 poorly defined

僤鯓⒐⒋嵵緔 提交于 2019-11-30 04:51:01
问题 Why in the world was _mm_crc32_u64(...) defined like this? unsigned int64 _mm_crc32_u64( unsigned __int64 crc, unsigned __int64 v ); The "crc32" instruction always accumulates a 32-bit CRC, never a 64-bit CRC (It is, after all, CRC32 not CRC64). If the machine instruction CRC32 happens to have a 64-bit destination operand, the upper 32 bits are ignored, and filled with 0's on completion, so there is NO use to EVER have a 64-bit destination. I understand why Intel allowed a 64-bit destination

What's the difference between __popcnt() and _mm_popcnt_u32()?

↘锁芯ラ 提交于 2019-11-27 16:29:14
问题 MS Visual C++ supports 2 flavors of the popcnt instruction on CPUs with SSE4.2: __popcnt() _mm_popcnt_u32() The only difference I found was that the docs for __popcnt() are marked as "Microsoft Specific", and _mm_popcnt_u32() seems to be an intrinsic command name (non-MS-specific). Is this the only difference, where the MS __popcnt() just calls the HW _mm_popcnt_u32() ? 回答1: These are two different intrinsic names for the same machine instruction, thanks to Intel and AMD. The instruction is

Can PTEST be used to test if two registers are both zero or some other condition?

感情迁移 提交于 2019-11-27 06:05:24
问题 What can you do with SSE4.1 ptest other than testing if a single register is all-zero? Can you use a combination of SF and CF to test anything useful about two unknown input registers? What is PTEST good for? You'd think it would be good for checking the result of a packed-compare (like PCMPEQD or CMPPS), but at least on Intel CPUs, it costs more uops to compare-and-branch using PTEST + JCC than with PMOVMSK(B/PS/PD) + macro-fused CMP+JCC. See also Checking if TWO SSE registers are not both