x86 Assembly - 2 largest values out of 4 given numbers

后端 未结 3 753
隐瞒了意图╮
隐瞒了意图╮ 2021-01-29 02:28

I\'m writing a C subroutine in assembler that needs to find the 2 largest values out of 4 values passed in and multiplies them together. I\'m working on finding the largest val

3条回答
  •  日久生厌
    2021-01-29 02:57

    Presumably you weren't looking for a SIMD answer, but I though it would be interesting to write. And yes, SSE instructions work in 16-bit mode. VEX-encoded instructions don't, so you can't use the AVX 3-operand versions. Fortunately, I was able to write it without any extra MOVDQA instructions anyway, so AVX wouldn't help.

    IDK how to answer this the way you probably want without just doing your homework for you. If you're actually interested in a high performance implementation, rather than just anything that works, please update your question.


    Since you only need to return the product of the two highest numbers, you could just produce all 6 pairwise products and take the max. (4 choose 2 = 6).

    If brute force doesn't work, you aren't using enough :P

    update: I just realized that this will give the wrong answer if the largest pairwise product is from two negative numbers. It will work if you can rule out negative inputs, or otherwise rule out inputs where this is a problem. See below for an SSE4.1 version that finds the max and 2nd-max separately.

    This does the trick with no branching, using SSE2. (You could do the same thing in MMX registers using only SSE1, which added the MMX-register version of PMAXSW). It's just 11 instructions (not counting the prologue/epilogue), and they're all fast, mostly single-uop on most CPUs. (See also the x86 tag wiki for more x86 links)

    ;; untested, but it does assemble (with NASM)
    BITS 16
    
    ;; We only evaluate 16-bit products, and use signed comparisons on them.
    max_product_of_4_args:
       push    bp
       mov     bp, sp
    
       ; load all 4 args into a SIMD vector
       movq    xmm0, [bp+4]              ;xmm0 = [ 0...0 d c b a ] (word elements)
       pshuflw xmm1, xmm0, 0b10010011    ;xmm1 = [ 0..   c b a d ] (rotated left)
       pshufd  xmm2, xmm0, 0b11110001    ;xmm2 = [ 0..   b a d c ] (swapped)
       pmullw  xmm1, xmm0                ; [ 0..  cd bc ab ad ]  (missing ac and bd)                                                                                    
       pmullw  xmm2, xmm0                ; [ 0..  bd ac bd ac ]
    
       ; then find the max word element between the bottom halves of xmm1 and xmm2
       pmaxsw  xmm1, xmm2
       ; now a horizontal max of xmm1
       pshuflw xmm0, xmm1, 0b00001110    ; elements[1:0] = elements[3:2], rest don't care
       pmaxsw  xmm0, xmm1
       pshuflw xmm1, xmm0, 0b00000001
       pmaxsw  xmm0, xmm1
    
       ; maximum product result in the low word of xmm0
       movd    eax, xmm0
       ; AX = the result.  Top half of EAX = garbage.  I'm assuming the caller only looks at a 16-bit return value.                                                     
    
       ; To clear the upper half of EAX, you could use this instead of MOVD:
       ;pextrw  eax, xmm0, 0                                                                                                                                            
       ; or sign extend AX into EAX with CWDE                                                                                                                           
    
    fin:                                                                                                                                                               
         pop bp                                                                                                                                                         
         ret                                                                                                                                                            
    end  
    

    If you want 32-bit products, PMAXSD is part of SSE4.1. Maybe unpack with zeros (or PMOVZXWD), and use PMADDWD to do 16b * 16b->32b vector multiplies. With the odd elements all zero, the horizontal add part of PMADDWD just gets the result of the signed multiply in the even elements.

    Fun fact: MOVD and pextrw eax, xmm0, 0 don't need an operand-size prefix to write to eax in 16-bit mode. The 66 prefix is already part of the required encoding. pextrw ax, xmm0, 0 doesn't assemble (with NASM).

    Fun fact #2: ndisasm -b16 incorrectly disassembles the MOVQ load as movq xmm0, xmm10:

    $ nasm -fbin 16bit-SSE.asm
    
    $ ndisasm -b16 16bit-SSE
    ...
    00000003  F30F7E4604        movq xmm0,xmm10
    ...
    
    $ objdump -b binary -Mintel -D  -mi8086 16bit-SSE
    ...
    3:   f3 0f 7e 46 04          movq   xmm0,QWORD PTR [bp+0x4]
    ...
    

    design notes for the 2 shuffle, 2 multiply way.

    [  d  c  b  a ] ; orig
    [  c  b  a  d ] ; pshuflw
      cd bc ab ad :  missing ac and bd
    
    [  b  a  d  c ] ; pshuflw.  (Using psrldq to shift in zeros would produce zero, but signed products can be < 0)
     ;; Actually, the max must be > 0, since two odd numbers will make a positive
    

    I looked at trying to only do one PMULLW by creating inputs for it with two shuffles. It would be easy with PSHUFB (with a 16-byte mask constant).

    But I'm trying to limit it to SSE2 (and maybe code that could be adapted to MMX). Here's one idea that didn't pan out.

    [  d  d  c  c  b  b  a  a ]   ; punpcklwd
    [  b  a  b  a  b  a  d  c ]   ; pshufd
      bd ad bc ac bb ab ad ac
    
    : ab ac ad
    :    bc bd
    :       cd(missing)
    :             bb(problem)
    

    I'm not even sure that would be better. It would need an extra shuffle to get the horizontal max. (If our elements were unsigned, maybe we could use SSE4.1 PHMINPOSUW on 0 - vec to find the max in one go, but the OP is using signed compares.)


    SSE4.1 PHMINPOSUW

    We can add 32768 to each element and then use unsigned stuff.

    Given a signed 16-bit val: rangeshift = val + 1<<15 maps the lowest to 0, and the highest to 65535. (add, subtract, or XOR (add-without-carry) are all equivalent for this.)

    Since we only have an instruction to find the horizontal minimum, we can reverse the range with negation. We need to do that first, because 0 stays 0, while 0xFFFF becomes 0x0001, etc.

    So -val + 1<<15, or mapped = 1<<15 - val maps our signed values to unsigned, in such a way that the lowest unsigned value is the greatest signed value. To reverse this: val = 1<<15 - mapped.

    Then we can use PHMINPOSUW to find the lowest (unsigned) word element (the max original element), mask that to all-ones, then PHMINPOSUW again to find the second-lowest.

    push    bp
    mov     bp, sp
    
    pcmpeqw  xmm5, xmm5         ; xmm5 = all-ones (anything compares == itself)
    psrlw    xmm5, 15           ; _mm_set1_epi16(1<<15)
    
    movq     xmm0, [bp+4]
    psubw    xmm5, xmm0         ; map the signed range to unsigned, in reverse order
    
    phminposuw xmm1, xmm5       ; xmm1 = [ 0...  minidx  minval ]
    movd     eax, xmm1          ; ax = minval
    
    psrldq   xmm1, 2            ; xmm1 = [ 0...          minidx ]
    psllw    xmm1, 4            ; xmm1 = [ 0...          minidx * 16 ]
    
    pcmpeqw  xmm2, xmm6
    psrlq    xmm2, 48           ; xmm2 = _mm_set1_epi64(0xFFFF)
    
    psllq    xmm2, xmm1         ; xmm2 = _mm_set1_epi64(0xFFFF << (minidx*16))
    ; force the min element to 65535, so we can go again and get the 2nd min (which might be 65535, but we don't care what position it was in)
    por      xmm2, xmm5
    
    phminposuw xmm3, xmm2
    movd     edx, xmm3          ; dx = 2nd min, upper half of edx=garbage (the index)
    
    mov      cx, 1<<15          ; undo the range shift
    neg      ax
    add      ax, cx
    sub      cx, dx
    
    imul     cx                 ; signed multiply dx:ax = ax * cx
    pop      bp
    ret                         ; return 32-bit result in dx:ax (or caller can look at only the low 16 bits in ax)
    

    This is more instructions. It might not be better than a CMP/CMOV sorting network using integer registers. (See @Terje's comment for a suggestion on what compare-and-swap to use).

提交回复
热议问题