问题
I'm trying to understand the comment made by "Iwillnotexist Idonotexist" at SIMD optimization of cvtColor using ARM NEON intrinsics:
... why you don't use the ARM NEON intrisics that map to the VLD3 instruction? That spares you all of the shuffling, both simplifying and speeding up the code. The Intel SSE implementation requires shuffles because it lacks 2/3/4-way deinterleaving load instructions, but you shouldn't pass on them when they are available.
The trouble I am having is the solution offers code that is non-interleaved, and it performs fused multiplies on floating points. I'm trying to separate the two and understand just the interleaved loads.
According to the other question's comment and Coding for NEON - Part 1: Load and Stores, the answer is probably going to use VLD3
.
Unfortunately, I'm just not seeing it (probably because I'm less familiar with NEON and its intrinsic functions). It seems like VLD3
basically produces 3 outputs for each input, so my metal model is confused.
Given the following SSE instrinsics that operate on data in BGR BGR BGR BGR...
format that needs a shuffle for BBBB GGGG RRRR ...
:
const byte* data = ... // assume 16-byte aligned
const __m128i mask = _mm_setr_epi8(0,3,6,9,12,15,1,4,7,10,13,2,5,8,11,14);
__m128i a = _mm_shuffle_epi8(_mm_load_si128((__m128i*)(data)),mask);
How do we perform the interleaved loads using NEON intrinsics so that the we don't need the SSE shuffles?
Also note... I'm interested in intrinsics and not ASM. I can use ARM's intrinsics on Windows Phone, Windows Store, and Linux powered devices under MSVC, ICC, Clang, etc. I can't do that with ASM, and I'm not trying to specialize the code 3 times (Microsoft 32-bit ASM, Microsoft 64-bit ASM and GCC ASM).
回答1:
According to this page:
The VLD3 intrinsic you need is:
int8x8x3_t vld3_s8(__transfersize(24) int8_t const * ptr);
// VLD3.8 {d0, d1, d2}, [r0]
If at address pointed by ptr
you have this data:
0x00: 33221100
0x04: 77665544
0x08: bbaa9988
0x0c: ffddccbb
0x10: 76543210
0x14: fedcba98
You will finally get in the registers:
d0: ba54ffbb99663300
d1: dc7610ccaa774411
d2: fe9832ddbb885522
The int8x8x3_t structure is defined as:
struct int8x8x3_t
{
int8x8_t val[3];
};
来源:https://stackoverflow.com/questions/37106500/neon-sse-and-interleaving-loads-vs-shuffles