mmx | 易学教程

Are different mmx, sse and avx versions complementary or supersets of each other?

阅读更多关于 Are different mmx, sse and avx versions complementary or supersets of each other?

I'm thinking I should familiarize myself with x86 SIMD extensions. But before I even began I ran into trouble. I can't find a good overview on which of them are still relevant. The x86 architecture has accumulated a lot of math/multimedia extensions over decades: MMX 3DNow! SSE SSE2 SSE3 SSSE3 SSE4 AVX AVX2 AVX512 Did I forget something? Are the newer ones supersets of the older ones and vice versa? Or are they complementary? Are some of them deprecated? Which of these are still relevant? I've heard references to "legacy SSE". Are some of them mutually exclusive? I.e. do they share the same

SSE intrinsics: Convert 32-bit floats to UNSIGNED 8-bit integers

阅读更多关于 SSE intrinsics: Convert 32-bit floats to UNSIGNED 8-bit integers

Using SSE intrinsics, I've gotten a vector of four 32-bit floats clamped to the range 0-255 and rounded to nearest integer. I'd now like to write those four out as bytes. There is an intrinsic _mm_cvtps_pi8 that will convert 32-bit to 8-bit signed int, but the problem there is that any value over 127 gets clamped to 127. I can't find any instructions that will clamp to unsigned 8-bit values. I have an intuition that what I may want to do is some combination of _mm_cvtps_pi16 and _mm_shuffle_pi8 followed by move instruction to get the four bytes I care about into memory. Is that the best way to

Converting a C++ project to x64 with __m64 references

阅读更多关于 Converting a C++ project to x64 with __m64 references

So when I started the conversion and set the target to 'x64', I get 7 unresolved externals. Two examples: error LNK2001: unresolved external symbol _m_empty ...CONVOLUTION_2D_USHORT.obj CONVOLUTION_2D_USHORT error LNK2001: unresolved external symbol _mm_setzero_si64 ...CONVOLUTION_2D_USHORT.obj CONVOLUTION_2D_USHORT So I tried investigating these a bit deeper, and I found that it doesn't like the __m64 inside the header files: Specifically mmintrin.h (there might be others). In my amateur hour with C++, because I haven't messed with the language in years, (I'm usually in the C# department), I

Converting a C++ project to x64 with __m64 references

阅读更多关于 Converting a C++ project to x64 with __m64 references

问题 So when I started the conversion and set the target to 'x64', I get 7 unresolved externals. Two examples: error LNK2001: unresolved external symbol _m_empty ...CONVOLUTION_2D_USHORT.obj CONVOLUTION_2D_USHORT error LNK2001: unresolved external symbol _mm_setzero_si64 ...CONVOLUTION_2D_USHORT.obj CONVOLUTION_2D_USHORT So I tried investigating these a bit deeper, and I found that it doesn't like the __m64 inside the header files: Specifically mmintrin.h (there might be others). In my amateur

SSE intrinsics: Convert 32-bit floats to UNSIGNED 8-bit integers

阅读更多关于 SSE intrinsics: Convert 32-bit floats to UNSIGNED 8-bit integers

问题 Using SSE intrinsics, I've gotten a vector of four 32-bit floats clamped to the range 0-255 and rounded to nearest integer. I'd now like to write those four out as bytes. There is an intrinsic _mm_cvtps_pi8 that will convert 32-bit to 8-bit signed int, but the problem there is that any value over 127 gets clamped to 127. I can't find any instructions that will clamp to unsigned 8-bit values. I have an intuition that what I may want to do is some combination of _mm_cvtps_pi16 and _mm_shuffle

SIMD prefix sum on Intel cpu

阅读更多关于 SIMD prefix sum on Intel cpu

I need to implement a prefix sum algorithm and would need it to be as fast as possible. Ex: [3, 1, 7, 0, 4, 1, 6, 3] should give [3, 4, 11, 11, 15, 16, 22, 25] Is there a way to do this using SSE/mmx/SIMD cpu instruction? My first idea is to sum each pair in parallel recursively until all sum have been computed like below! //in parallel do for (int i = 0; i<z.length; i++){ z[i] = x[i<<1] + x[(i<<1)+1]; } To make the algorithm a little bit more clear "z" is not the final ouput but instead used to compute the ouput int[] w = computePrefixSum(z); for (int i = 1; i<ouput.length; i++){ ouput[i] =

SIMD prefix sum on Intel cpu

阅读更多关于 SIMD prefix sum on Intel cpu

问题 I need to implement a prefix sum algorithm and would need it to be as fast as possible. Ex: [3, 1, 7, 0, 4, 1, 6, 3] should give: [3, 4, 11, 11, 15, 16, 22, 25] Is there a way to do this using SSE/MMX/SIMD CPU instruction? My first idea is to sum each pair in parallel recursively until all sum have been computed like below! //in parallel do for (int i = 0; i < z.length; i++) { z[i] = x[i << 1] + x[(i << 1) + 1]; } To make the algorithm a little bit more clear, z is not the final output, but