I have a pointer to an array of bytes mixed
that contains the interleaved bytes of two distinct arrays array1
and array2
. Say mi
For x86 SSE, the pack
and punpck
instructions are what you need. Examples using AVX for the convenience of non-destructive 3-operand instructions. (Not using AVX2 256b-wide instructions, because the 256b pack/unpck instructions do two 128b unpacks in the low and high 128b lanes, so you'd need a shuffle to get things in the correct final order.)
An intrinsics version of the following would work the same. Asm instructions are shorter to type for just writing a quick answer.
Interleave: abcd
and 1234
-> a1b2c3d4
:
# loop body:
vmovdqu (%rax), %xmm0 # load the sources
vmovdqu (%rbx), %xmm1
vpunpcklbw %xmm0, %xmm1, %xmm2 # low halves -> 128b reg
vpunpckhbw %xmm0, %xmm2, %xmm3 # high halves -> 128b reg
vmovdqu %xmm2, (%rdi) # store the results
vmovdqu %xmm3, 16(%rdi)
# blah blah some loop structure.
`punpcklbw` interleaves the bytes in the low 64 of the two source `xmm` registers. There are `..wd` (word->dword), and dword->qword versions which would be useful for 16 or 32bit elements.
De-interleave: a1b2c3d4
-> abcd
and 1234
#outside the loop
vpcmpeqb %xmm5, %xmm5 # set to all-1s
vpsrlw $8, %xmm5, %xmm5 # every 16b word has low 8b = 0xFF, high 8b = 0.
# loop body
vmovdqu (%rsi), %xmm2 # load two src chunks
vmovdqu 16(%rsi), %xmm3
vpand %xmm2, %xmm5, %xmm0 # mask to leave only the odd bytes
vpand %xmm3, %xmm5, %xmm1
vpackuswb %xmm0, %xmm1, %xmm4
vmovdqu %xmm4, (%rax) # store 16B of a[]
vpsrlw $8, %xmm2, %xmm6 # even bytes -> odd bytes
vpsrlw $8, %xmm3, %xmm7
vpackuswb %xmm6, %xmm7, %xmm4
vmovdqu %xmm4, (%rbx)
This can of course use a lot fewer registers. I avoided reusing registers for readability, not performance. Hardware register renaming makes reuse a non-issue, as long as you start with something that doesn't depend on the previous value. (e.g. movd
, not movss
or pinsrd
.)
Deinterleave is so much more work because the pack
instructions do signed or unsigned saturation, so the upper 8b of each 16b element has to be zeroed first.
An alternative would be to use pshufb
to pack the odd or even words of a single source reg into the low 64 of a register. However, outside of the AMD XOP instruction set's VPPERM
, there isn't a shuffle that can select bytes from 2 registers at once (like Altivec's much-loved vperm
). So with just SSE/AVX, you'd need 2 shuffles for every 128b of interleaved data. And since store-port usage could be the bottleneck, a punpck
to combine two 64bit chunks of a
into a single register to set up a 128b store.
With AMD XOP, deinterleave would be 2x128b loads, 2 VPPERM
, and 2x128b stores.