Fastest de-interleave operation in C?

后端 未结 6 1668
一个人的身影
一个人的身影 2021-01-02 00:30

I have a pointer to an array of bytes mixed that contains the interleaved bytes of two distinct arrays array1 and array2. Say mi

6条回答
  •  醉梦人生
    2021-01-02 00:56

    For x86 SSE, the pack and punpck instructions are what you need. Examples using AVX for the convenience of non-destructive 3-operand instructions. (Not using AVX2 256b-wide instructions, because the 256b pack/unpck instructions do two 128b unpacks in the low and high 128b lanes, so you'd need a shuffle to get things in the correct final order.)

    An intrinsics version of the following would work the same. Asm instructions are shorter to type for just writing a quick answer.

    Interleave: abcd and 1234 -> a1b2c3d4:

    # loop body:
    vmovdqu    (%rax), %xmm0  # load the sources
    vmovdqu    (%rbx), %xmm1
    vpunpcklbw %xmm0, %xmm1, %xmm2  # low  halves -> 128b reg
    vpunpckhbw %xmm0, %xmm2, %xmm3  # high halves -> 128b reg
    vmovdqu    %xmm2, (%rdi)   # store the results
    vmovdqu    %xmm3, 16(%rdi)
    # blah blah some loop structure.
    
    `punpcklbw` interleaves the bytes in the low 64 of the two source `xmm` registers.  There are `..wd` (word->dword), and dword->qword versions which would be useful for 16 or 32bit elements.
    

    De-interleave: a1b2c3d4 -> abcd and 1234

    #outside the loop
    vpcmpeqb    %xmm5, %xmm5   # set to all-1s
    vpsrlw     $8, %xmm5, %xmm5   # every 16b word has low 8b = 0xFF, high 8b = 0.
    
    # loop body
    vmovdqu    (%rsi), %xmm2     # load two src chunks
    vmovdqu    16(%rsi), %xmm3
    vpand      %xmm2, %xmm5, %xmm0  # mask to leave only the odd bytes
    vpand      %xmm3, %xmm5, %xmm1
    vpackuswb  %xmm0, %xmm1, %xmm4
    vmovdqu    %xmm4, (%rax)    # store 16B of a[]
    vpsrlw     $8, %xmm2, %xmm6     # even bytes -> odd bytes
    vpsrlw     $8, %xmm3, %xmm7
    vpackuswb  %xmm6, %xmm7, %xmm4
    vmovdqu    %xmm4, (%rbx)
    

    This can of course use a lot fewer registers. I avoided reusing registers for readability, not performance. Hardware register renaming makes reuse a non-issue, as long as you start with something that doesn't depend on the previous value. (e.g. movd, not movss or pinsrd.)

    Deinterleave is so much more work because the pack instructions do signed or unsigned saturation, so the upper 8b of each 16b element has to be zeroed first.

    An alternative would be to use pshufb to pack the odd or even words of a single source reg into the low 64 of a register. However, outside of the AMD XOP instruction set's VPPERM, there isn't a shuffle that can select bytes from 2 registers at once (like Altivec's much-loved vperm). So with just SSE/AVX, you'd need 2 shuffles for every 128b of interleaved data. And since store-port usage could be the bottleneck, a punpck to combine two 64bit chunks of a into a single register to set up a 128b store.

    With AMD XOP, deinterleave would be 2x128b loads, 2 VPPERM, and 2x128b stores.

提交回复
热议问题