NEON, SSE and interleaving loads vs shuffles

独自空忆成欢 提交于 2019-12-19 04:55:14

问题


I'm trying to understand the comment made by "Iwillnotexist Idonotexist" at SIMD optimization of cvtColor using ARM NEON intrinsics:

... why you don't use the ARM NEON intrisics that map to the VLD3 instruction? That spares you all of the shuffling, both simplifying and speeding up the code. The Intel SSE implementation requires shuffles because it lacks 2/3/4-way deinterleaving load instructions, but you shouldn't pass on them when they are available.

The trouble I am having is the solution offers code that is non-interleaved, and it performs fused multiplies on floating points. I'm trying to separate the two and understand just the interleaved loads.

According to the other question's comment and Coding for NEON - Part 1: Load and Stores, the answer is probably going to use VLD3.

Unfortunately, I'm just not seeing it (probably because I'm less familiar with NEON and its intrinsic functions). It seems like VLD3 basically produces 3 outputs for each input, so my metal model is confused.

Given the following SSE instrinsics that operate on data in BGR BGR BGR BGR... format that needs a shuffle for BBBB GGGG RRRR ...:

const byte* data = ...  // assume 16-byte aligned
const __m128i mask = _mm_setr_epi8(0,3,6,9,12,15,1,4,7,10,13,2,5,8,11,14);
__m128i a = _mm_shuffle_epi8(_mm_load_si128((__m128i*)(data)),mask);

How do we perform the interleaved loads using NEON intrinsics so that the we don't need the SSE shuffles?


Also note... I'm interested in intrinsics and not ASM. I can use ARM's intrinsics on Windows Phone, Windows Store, and Linux powered devices under MSVC, ICC, Clang, etc. I can't do that with ASM, and I'm not trying to specialize the code 3 times (Microsoft 32-bit ASM, Microsoft 64-bit ASM and GCC ASM).


回答1:


According to this page:

The VLD3 intrinsic you need is:

int8x8x3_t  vld3_s8(__transfersize(24) int8_t const * ptr);
// VLD3.8 {d0, d1, d2}, [r0]

If at address pointed by ptr you have this data:

0x00: 33221100
0x04: 77665544
0x08: bbaa9988
0x0c: ffddccbb
0x10: 76543210
0x14: fedcba98

You will finally get in the registers:

d0: ba54ffbb99663300
d1: dc7610ccaa774411
d2: fe9832ddbb885522

The int8x8x3_t structure is defined as:

struct int8x8x3_t
{
   int8x8_t val[3];
};


来源:https://stackoverflow.com/questions/37106500/neon-sse-and-interleaving-loads-vs-shuffles

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!