Fastest de-interleave operation in C?

后端未结
关注
 6  1668
一个人的身影 2021-01-02 00:30
I have a pointer to an array of bytes mixed that contains the interleaved bytes of two distinct arrays array1 and array2. Say mi

      
      
        
          6条回答        

        
                    
            
            
                         
                
              
              
                
                   醉梦人生
                                             
                
                
                (楼主)
            
              
              
                2021-01-02 00:56
              

            
            
                        
For x86 SSE, the pack and punpck instructions are what you need.  Examples using AVX for the convenience of non-destructive 3-operand instructions.  (Not using AVX2 256b-wide instructions, because the 256b pack/unpck instructions do two 128b unpacks in the low and high 128b lanes, so you'd need a shuffle to get things in the correct final order.)

An intrinsics version of the following would work the same.  Asm instructions are shorter to type for just writing a quick answer.

Interleave: abcd and 1234 -> a1b2c3d4:

# loop body:
vmovdqu    (%rax), %xmm0  # load the sources
vmovdqu    (%rbx), %xmm1
vpunpcklbw %xmm0, %xmm1, %xmm2  # low  halves -> 128b reg
vpunpckhbw %xmm0, %xmm2, %xmm3  # high halves -> 128b reg
vmovdqu    %xmm2, (%rdi)   # store the results
vmovdqu    %xmm3, 16(%rdi)
# blah blah some loop structure.

`punpcklbw` interleaves the bytes in the low 64 of the two source `xmm` registers.  There are `..wd` (word->dword), and dword->qword versions which would be useful for 16 or 32bit elements.


De-interleave: a1b2c3d4 -> abcd and 1234

#outside the loop
vpcmpeqb    %xmm5, %xmm5   # set to all-1s
vpsrlw     $8, %xmm5, %xmm5   # every 16b word has low 8b = 0xFF, high 8b = 0.

# loop body
vmovdqu    (%rsi), %xmm2     # load two src chunks
vmovdqu    16(%rsi), %xmm3
vpand      %xmm2, %xmm5, %xmm0  # mask to leave only the odd bytes
vpand      %xmm3, %xmm5, %xmm1
vpackuswb  %xmm0, %xmm1, %xmm4
vmovdqu    %xmm4, (%rax)    # store 16B of a[]
vpsrlw     $8, %xmm2, %xmm6     # even bytes -> odd bytes
vpsrlw     $8, %xmm3, %xmm7
vpackuswb  %xmm6, %xmm7, %xmm4
vmovdqu    %xmm4, (%rbx)


This can of course use a lot fewer registers.  I avoided reusing registers for readability, not performance.  Hardware register renaming makes reuse a non-issue, as long as you start with something that doesn't depend on the previous value.  (e.g. movd, not movss or pinsrd.)

Deinterleave is so much more work because the pack instructions do signed or unsigned saturation, so the upper 8b of each 16b element has to be zeroed first.

An alternative would be to use pshufb to pack the odd or even words of a single source reg into the low 64 of a register.  However, outside of the AMD XOP instruction set's VPPERM, there isn't a shuffle that can select bytes from 2 registers at once (like Altivec's much-loved vperm).  So with just SSE/AVX, you'd need 2 shuffles for every 128b of interleaved data.  And since store-port usage could be the bottleneck, a punpck to combine two 64bit chunks of a into a single register to set up a 128b store.

With AMD XOP, deinterleave would be 2x128b loads, 2 VPPERM, and 2x128b stores.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它6个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复