Packing BCD to DPD: How to improve this amd64 assembly routine?

前端 未结 2 926
佛祖请我去吃肉
佛祖请我去吃肉 2020-12-19 02:49

I\'m writing a routine to convert between BCD (4 bits per decimal digit) and Densely Packed Decimal (DPD) (10 bits per 3 decimal digits). DPD is further documented (with th

2条回答
  •  醉梦人生
    2020-12-19 03:19

    (I split the BMI2 version into a separate answer, since it could end up totally different)


    After seeing what you're doing with that imul/shr to get a table index, I can see where you could use BMI2 pextr to replace and/imul/shr, or BMI1 bextr to replace just the shr (allowing use of imul32, instead of imul16, since you'd just extract the bits you want, rather than needing to shift zeros from the upper16). There are AMD CPUs with BMI1, but even steamroller lacks BMI2. Intel introduced BMI1 and BMI2 at the same time with Haswell.

    You could maybe process two or four 16bit words at once, with 64bit pextr. But not for the whole algorithm: you can't do 4 parallel table lookups. (AVX2 VPGATHERDD is not worth using here.) Actually, you can use pshufb to implement a LUT with indices up to 4bits, see below.

    Minor improvement version:

    .section .rodata
      # This won't won't assemble, written this way for humans to line up with comments.
    
    extmask_lobits:     .long           0b0000 0111 0111 0111
    extmask_hibits:     .long           0b0000 1000 1000 1000
    
    # pext doesn't have an immediate-operand form, but it can take the mask from a memory operand.
    # Load these into regs if running in a tight loop.
    
    #### TOTALLY UNTESTED #####
    .text
    .p2align 4,,10
    bcd2dpd_bmi2:
    #       mov   %edi,%eax           #   = 0000 abcd efgh iklm
    #       shl   %al                 #   = 0000 abcd fghi klm0
    #       shr   %eax                #   = 0000 0abc dfgh iklm
    
            pext  extmask_lobits, %edi, %eax
                                    #   = 0000 0abc dfgh iklm
            mov   %eax, %esi        # insn scheduling for 4-issue front-end: Fast-path is 4 fused-domain uops
              # And doesn't waste issue capacity when we're taking the slow path.  CPUs with mov-elimination won't waste execution units from issuing an extra mov
            test  $0x880, %edi        # fast path for a = e = 0
            jnz .Lslow_path
            ret
    
    .p2align 4
    .Lslow_path:
            # 8 uops, including the `ret`: can issue in 2 clocks.
    
            # replaces and/imul/shr
            pext  extmask_hibits, %edi, %edi #u= 0000 0000 0000 0aei
            and   $0x66, %esi                # q = 0000 0000 0fg0 0kl0
            imul  tab-8(,%rdi,4), %esi       # v = q * tab[u-2][0]
            and   $0x397, %eax               # r = 0000 00bc d00h 0klm
            xor   %esi, %eax                 # w = r ^ v
            or    tab-6(,%rdi,4), %eax       # x = w | tab[u-2][1]
            and   $0x3ff, %eax               #   = 0000 00xx xxxx xxxx
            ret
    

    Of course, if making this an inline-asm, rather than a stand-alone function, you'd change back to the fast path branching to the end, and the slow-path falling through. And you wouldn't waste space with alignment padding mid-function either.

    There might be more scope for using pextr and/or pdep for more of the rest of the function.


    I was thinking about how to do even better with BMI2. I think we could get multiple aei selectors from four shorts packed into 64b, then use pdep to deposit them in the low bits of different bytes. Then movq that to a vector register, where you use it as a shuffle control-mask for pshufb to do multiple 4bit LUT lookups.

    So we could go from 60 BCD bits to 50 DPD bits at a time. (Use shrd to shift bits between registers to handle loads/stores to byte-addressable memory.)

    Actually, 48 BCD bits (4 groups of 12bits each) -> 40 DPD bits is probably a lot easier, because you can unpack that to 4 groups of 16bits in a 64b integer register, using pdep. Dealing with the selectors for 5 groups is fine, you can unpack with pmovzx, but dealing with the rest of the data would require bit-shuffling in vector registers. Not even the slow AVX2 variable-shift insns would make that easy to do. (Although it might be interesting to consider how to implement this with BMI2 at all, for big speedups on CPUs with just SSSE3 (i.e. every relevant CPU) or maybe SSE4.1.)

    This also means we can put two clusters of 4 groups into the low and high halves of a 128b register, to get even more parallelism.

    As a bonus, 48bits is a whole number of bytes, so reading from a buffer of BCD digits wouldn't require any shrd insns to get the leftover 4 bits from the last 64b into the low 4 for the next. Or two offset pextr masks to work when the 4 ignored bits were the low or high 4 of the 64b.... Anyway, I think doing 5 groups at once just isn't worth considering.

    Full BMI2 / AVX pshufb LUT version (vectorizable)

    The data movement could be:

    ignored | group 3        | group 2        | group 1        |  group 0
    16bits  | abcd efgh iklm | abcd efgh iklm | abcd efgh iklm | abcd efgh iklm
    
             3   2   1 | 0
    pext -> aei|aei|aei|aei  # packed together in the low bits
    
              2  |      1            |        0
    pdep ->  ... |0000 0000 0000 0aei|0000 0000 0000 0aei  # each in a separate 16b word
    
    movq -> xmm vector register.
     (Then pinsrq another group of 4 selectors into the upper 64b of the vector reg).  So the vector part can handle 2 (or AVX2: 4) of this at once
    
    vpshufb xmm2 -> map each byte to another byte (IMUL table)
    vpshufb xmm3 -> map each byte to another byte (OR table)
    
    
    Get the bits other than `aei` from each group of 3 BCD digits unpacked from 48b to 64b, into separate 16b words:
    
                      group 3       | group 2             | group 1             |  group 0
    pdep(src)-> 0000 abcd efgh iklm | 0000 abcd efgh iklm | 0000 abcd efgh iklm | 0000 abcd efgh iklm
    
     movq this into a vector reg (xmm1).  (And repeat for the next 48b and pinsrq that to the upper64)
    
    VPAND  xmm1, mask  (to zero aei in each group)
    
     Then use the vector-LUT results:
    VPMULLW xmm1, xmm2 -> packed 16b multiply, keeping only the low16 of the result
    
    VPAND   xmm1,  mask
    VPXOR   xmm1, something
    VPOR    xmm1, xmm3
    
    movq / pextrq back to integer regs
    
    pext to pack the bits back together
      You don't need the AND 0x3ff or equivalent:
      Those bits go away when you pext to pack each 16b down to 10b
    
    shrd or something to pack the 40b results of this into 64b chunks for store to memory.
      Or: 32b store, then shift and store the last 8b, but that seems lame
      Or: just do 64b stores, overlapping with the previous.  So you write 24b of garbage every time.  Take care at the very end of the buffer.
    

    Use AVX 3-operand versions of the 128b SSE instructions to avoid needing movdqa to not overwrite the table for pshufb. As long as you never run a 256b AVX instruction, you don't need to mess with vzeroupper. You might as well use the v (VEX) versions of all vector instructions, though, if you use any. Inside a VM, you might be running on a virtual CPU with BMI2 but not AVX support, so it's prob. still a good idea to check both CPU feature flags, rather than assuming AVX if you see BMI2 (even though that's safe for all physical hardware that currently exists).


    This is starting to look really efficient. It might be worth doing the mul/xor/and stuff in vector regs, even if you don't have BMI2 pext/pdep to do the bit packing/unpacking. I guess you could use code like the existing non-BMI scalar routing to get selectors, and mask/shift/or could build up the non-selector data into 16b chunks. Or maybe shrd for shifting data from one reg into another?

提交回复
热议问题