sse2 | 易学教程

What's the difference between logical SSE intrinsics?

阅读更多关于 What's the difference between logical SSE intrinsics?

问题 Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands. My questions: Is there any difference between using one or another intrinsic (with appropriate type casting). Won't there be any hidden costs like longer execution in some specific situation? These intrinsics maps to three different x86

SSE multiplication of 4 32-bit integers

阅读更多关于 SSE multiplication of 4 32-bit integers

问题 How to multiply four 32-bit integers by another 4 integers? I didn't find any instruction which can do it. 回答1: If you need signed 32x32 bit integer multiplication then the following example at software.intel.com looks like it should do what you want: static inline __m128i muly(const __m128i &a, const __m128i &b) { __m128i tmp1 = _mm_mul_epu32(a,b); /* mul 2,0*/ __m128i tmp2 = _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4)); /* mul 3,1 */ return _mm_unpacklo_epi32(_mm_shuffle_epi32

C: x86 Intel Intrinsics usage of _mm_log2_ps() -> error: incompatible type 'int'?

阅读更多关于 C: x86 Intel Intrinsics usage of _mm_log2_ps() -> error: incompatible type 'int'?

问题 I'm trying to apply the log2 onto a __m128 variable. Like this: #include <immintrin.h> int main (void) { __m128 two_v = {2.0, 2.0, 2.0, 2.0}; __m128 log2_v = _mm_log2_ps(two_v); // log_2 := log(2) return 0; } Trying to compile this returns this error: error: initializing '__m128' with an expression of incompatible type 'int' __m128 log2_v = _mm_log2_ps(two_v); // log_2 := log(2) ^ ~~~~~~~~~~~~~~~~~~ How can I fix it? 回答1: The immintrin.h you look into and immintrin.h used for compilation are

Fast copy every second byte to new memory area

阅读更多关于 Fast copy every second byte to new memory area

问题 I need a fast way to copy every second byte to a new malloc'd memory area. I have a raw image with RGB data and 16 bits per channel (48 bit) and want to create an RGB image with 8 bits per channel (24 bit). Is there a faster method than copying bytewise? I don't know much about SSE2, but I suppose it's possible with SSE/SSE2. 回答1: Your RGB data is packed, so we don't actually have to care about pixel boundaries. The problem is just packing every other byte of an array. (At least within each

Is SSE2 signed integer overflow undefined?

阅读更多关于 Is SSE2 signed integer overflow undefined?

问题 Signed integer overflow is undefined in C and C++. But what about signed integer overflow within the individual fields of an __m128i ? In other words, is this behavior defined in the Intel standards? #include <inttypes.h> #include <stdio.h> #include <stdint.h> #include <emmintrin.h> union SSE2 { __m128i m_vector; uint32_t m_dwords[sizeof(__m128i) / sizeof(uint32_t)]; }; int main() { union SSE2 reg = {_mm_set_epi32(INT32_MAX, INT32_MAX, INT32_MAX, INT32_MAX)}; reg.m_vector = _mm_add_epi32(reg

How to vectorize a distance calculation using SSE2

阅读更多关于 How to vectorize a distance calculation using SSE2

问题 A and B are vectors or length N, where N could be in the range 20 to 200 say. I want to calculate the square of the distance between these vectors, i.e. d^2 = ||A-B||^2. So far I have: float* a = ...; float* b = ...; float d2 = 0; for(int k = 0; k < N; ++k) { float d = a[k] - b[k]; d2 += d * d; } That seems to work fine, except that I have profiled my code and this is the bottleneck (more than 50% of time is spent just doing this). I am using Visual Studio 2012, on Win 7, with these

how to deinterleave image channel in SSE

阅读更多关于 how to deinterleave image channel in SSE

问题 is there any way we can DE-interleave 32bpp image channels similar as below code in neon. //Read all r,g,b,a pixels into 4 registers uint8x8x4_t SrcPixels8x8x4= vld4_u8(inPixel32); ChannelR1_32x4 = vmovl_u16(vget_low_u16(vmovl_u8(SrcPixels8x8x4.val[0]))), channelR2_32x4 = vmovl_u16(vget_high_u16(vmovl_u8(SrcPixels8x8x4.val[0]))), vGaussElement_32x4_high); basically i want all color channels in separate vectors with every vector has 4 elements of 32bits to do some calculation but i am not very

Auto-vectorization in visual studio 2012 on vectors of Eigen type is not performing well

阅读更多关于 Auto-vectorization in visual studio 2012 on vectors of Eigen type is not performing well

问题 I have std::vector of Eigen::vector3d types and when i am compiling this code using Microsoft Visual Studio 2012 having the /Qvec-report:2 flag on for reporting vectorization details. It's showing Loop not vectorized due to reason 1304 (Loop contains assignments that are of different types) as specified on the msdn page -https://msdn.microsoft.com/en-us/library/jj658585.aspx My code is as below: #include <iostream> #include <vector> #include <time.h> #include<Eigen/StdVector> int main(char

can't find materials about SSE2, Altivec, VMX on apple developer

阅读更多关于 can't find materials about SSE2, Altivec, VMX on apple developer

问题 as Paul. R sugguested that there are plenty of resources about SSE2 , AVX on apple developer but I couldn't find it. Could anyone helps me ? BTW, I also looking for the archive of mail-list of altivec. thanks! Intel SSE and AVX Examples and Tutorials 来源： https://stackoverflow.com/questions/22978362/cant-find-materials-about-sse2-altivec-vmx-on-apple-developer

Why can't I use _mm_sin_pd? [duplicate]

阅读更多关于 Why can't I use _mm_sin_pd? [duplicate]

问题 This question already has answers here : C++ error: ‘_mm_sin_ps’ was not declared in this scope (3 answers) how can I use SVML instructions [duplicate] (1 answer) Where is Clang's '_mm256_pow_ps' intrinsic? (1 answer) Closed 11 months ago . Specifics says: __m128d _mm_sin_pd (__m128d a) #include <immintrin.h> CPUID Flags: SSE Description Compute the sine of packed double-precision (64-bit) floating-point elements in a expressed in radians, and store the results in dst. But it seems it is not