sse2 | 易学教程

Optimizing RGB565 to RGB888 conversions with SSE2

阅读更多关于 Optimizing RGB565 to RGB888 conversions with SSE2

问题 I'm trying to optimize pixel depth conversion from 565 to 888 using SSE2 with the basic formula: col8 = col5 << 3 | col5 >> 2 col8 = col6 << 2 | col6 >> 4 I take two 2x565 128-bit vectors and I'm outputing 3x888 128-bit vectors. After some masking, shifting and OR'ing I came to the point when I have two vectors with ((blue << 8) | red)* 8-bit colors stored in 16-bit words and a similar vectors with zero-green values. Now I need to combine them into 888 output. BR: BR7-BR6-...-BR1-BR0 0G: 0G7

Why do x86 FP compares set CF like unsigned integers, instead of using signed conditions?

阅读更多关于 Why do x86 FP compares set CF like unsigned integers, instead of using signed conditions?

问题 The following documentation is provided in the Intel Instruction Reference for the COMISD instruction: Compares the double-precision floating-point values in the low quadwords of operand 1 (first operand) and operand 2 (second operand), and sets the ZF , PF , and CF flags in the EFLAGS register according to the result (unordered, greater than, less than, or equal). The CF 's flag point is not really clear here since it is related to arithmetic operations on unsigned integers. By contrast, the

Determine processor support for SSE2?

阅读更多关于 Determine processor support for SSE2?

问题 I need to do determine processor support for SSE2 prior installing a software. From what I understand, I came up with this: bool TestSSE2(char * szErrorMsg) { __try { __asm { xorpd xmm0, xmm0 // executing SSE2 instruction } } #pragma warning (suppress: 6320) __except (EXCEPTION_EXECUTE_HANDLER) { if (_exception_code() == STATUS_ILLEGAL_INSTRUCTION) { _tcscpy_s(szErrorMsg,MSGSIZE, _T("Streaming SIMD Extensions 2(SSE2) is not supported by the CPU.\r\n Unable to launch APP")); return false; }

What is __m128d?

阅读更多关于 What is __m128d?

问题 I really can't get what "keyword" like __m128d is in C++. I'm using MSVC, and it says: The __m128d data type, for use with the Streaming SIMD Extensions 2 instructions intrinsics, is defined in <emmintrin.h> . So, is it a Data Type? typedef ? If I do: #include <emmintrin.h> int main() { __m128d x; } I can't see the defination on <emmintrin.h> . It seems a keyword of compiler? Does it automatically convert that keyword to somethings like "move register xmm0" etc? Or which kind of operation

How to convert two _pd into one _ps?

阅读更多关于 How to convert two _pd into one _ps?

问题 I'm looping some data, calculating some double and every 2 __m128d operations, I want to store the data on a __m128 float. So 64+64 + 64+64 (2 __m128d ) stored into 1 32+32+32+32 __m128 . I do somethings like this: __m128d v_result; __m128 v_result_float; ... // some operations on v_result // store the first two "slot" on float v_result_float = _mm_cvtpd_ps(v_result); // some operations on v_result // I need to store the last two "slot" on float v_result_float = _mm_cvtpd_ps(v_result); ?!?

Complex data reorganization with vector instructions

阅读更多关于 Complex data reorganization with vector instructions

问题 I need to load and rearrange 12 bytes into 16 (or 24 into 32) following the pattern below: ABC DEF GHI JKL becomes ABBC DEEF GHHI JKKL Can you suggest efficient ways to achieve this using the SSE(2) and/or AVX(2) instructions ? This needs to be performed repeatedly, so pre-stored masks or constants are allowed. 回答1: By far your best bet is to use a byte shuffle ( pshufb ) . Shifting within elements isn't enough by itself, since JKL has to move farther to the right than DEF , etc. etc. So you

Why does V8 in Node.js 0.12.0 release require SSE2 CPU instructions?

阅读更多关于 Why does V8 in Node.js 0.12.0 release require SSE2 CPU instructions?

问题 Trying to upgrade Node.js from 0.10.x to 0.12.0. The first thing noticed is that I am getting an error that SSE2 instructions are not supported by my CPU (indeed they are not). Tried to compile Node.js from sources but it failed for the same reason. In deps/v8/src/ia32/assembler-ia32.cc there is a line stating CHECK(cpu.has_sse2()); // SSE2 support is mandatory. I wonder if there is a way to get rid of this SSE2 dependency which was not required in Node.js 0.10.x. Just commenting out this

How can I set __m128i without using of any SSE instruction?

阅读更多关于 How can I set __m128i without using of any SSE instruction?

问题 I have many function which use the same constant __m128i values. For example: const __m128i K8 = _mm_setr_epi8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16); const __m128i K16 = _mm_setr_epi16(1, 2, 3, 4, 5, 6, 7, 8); const __m128i K32 = _mm_setr_epi32(1, 2, 3, 4); So I want to store all these constants in an one place. But there is a problem: I perform checking of existed CPU extension in run time. If the CPU doesn't support for example SSE (or AVX) than will be a program crash

Fast counting the number of equal bytes between two arrays [duplicate]

阅读更多关于 Fast counting the number of equal bytes between two arrays [duplicate]

问题 This question already has answers here : Can counting byte matches between two strings be optimized using SIMD? (3 answers) Closed 8 months ago . I wrote the function int compare_16bytes(__m128i lhs, __m128i rhs) in order to compare two 16 byte numbers using SSE instructions: this function returns how many bytes are equal after performing the comparison. Now I would like use the above function in order to compare two byte arrays of arbitrary length: the length may not be a multiple of 16

Fast counting the number of set bits in __m128i register

阅读更多关于 Fast counting the number of set bits in __m128i register

问题 I should count the number of set bits of a __m128i register. In particular, I should write two functions that are able to count the number of bits of the register, using the following ways. The total number of set bits of the register. The number of set bits for each byte of the register. Are there intrinsic functions that can perform, wholly or partially, the above operations? 回答1: Here are some codes I used in an old project (there is a research paper about it). The function popcnt8 below