Complex Mul and Div using sse Instructions

后端 未结 3 1090
情歌与酒
情歌与酒 2021-02-08 14:17

Is performing complex multiplication and division beneficial through SSE instructions? I know that addition and subtraction perform better when using SSE. Can someone tell me ho

3条回答
  •  心在旅途
    2021-02-08 14:45

    Well complex multiplication is defined as:

    ((c1a * c2a) - (c1b * c2b)) + ((c1b * c2a) + (c1a * c2b))i
    

    So your 2 components in a complex number would be

    ((c1a * c2a) - (c1b * c2b)) and ((c1b * c2a) + (c1a * c2b))i
    

    So assuming you are using 8 floats to represent 4 complex numbers defined as follows:

    c1a, c1b, c2a, c2b
    c3a, c3b, c4a, c4b
    

    And you want to simultaneously do (c1 * c3) and (c2 * c4) your SSE code would look "something" like the following:

    (Note I used MSVC under windows but the principle WILL be the same).

    __declspec( align( 16 ) ) float c1c2[]        = { 1.0f, 2.0f, 3.0f, 4.0f };
    __declspec( align( 16 ) ) float c3c4[]          = { 4.0f, 3.0f, 2.0f, 1.0f };
    __declspec( align( 16 ) ) float mulfactors[]    = { -1.0f, 1.0f, -1.0f, 1.0f };
    __declspec( align( 16 ) ) float res[]           = { 0.0f, 0.0f, 0.0f, 0.0f };
    
    __asm 
    {
        movaps xmm0, xmmword ptr [c1c2]         // Load c1 and c2 into xmm0.
        movaps xmm1, xmmword ptr [c3c4]         // Load c3 and c4 into xmm1.
        movaps xmm4, xmmword ptr [mulfactors]   // load multiplication factors into xmm4
    
        movaps xmm2, xmm1                       
        movaps xmm3, xmm0                       
        shufps xmm2, xmm1, 0xA0                 // Change order to c3a c3a c4a c4a and store in xmm2
        shufps xmm1, xmm1, 0xF5                 // Change order to c3b c3b c4b c4b and store in xmm1
        shufps xmm3, xmm0, 0xB1                 // change order to c1b c1a c2b c2a abd store in xmm3
    
        mulps xmm0, xmm2                        
        mulps xmm3, xmm1                    
        mulps xmm3, xmm4                        // Flip the signs of the 'a's so the add works correctly.
    
        addps xmm0, xmm3                        // Add together
    
        movaps xmmword ptr [res], xmm0          // Store back out
    };
    
    float res1a = (c1c2[0] * c3c4[0]) - (c1c2[1] * c3c4[1]);
    float res1b = (c1c2[1] * c3c4[0]) + (c1c2[0] * c3c4[1]);
    
    float res2a = (c1c2[2] * c3c4[2]) - (c1c2[3] * c3c4[3]);
    float res2b = (c1c2[3] * c3c4[2]) + (c1c2[2] * c3c4[3]);
    
    if ( res1a != res[0] || 
         res1b != res[1] || 
         res2a != res[2] || 
         res2b != res[3] )
    {
        _exit( 1 );
    }
    

    What I've done above is I've simplified the maths out a bit. Assuming the following:

    c1a c1b c2a c2b
    c3a c3b c4a c4b
    

    By rearranging I end up with the following vectors

    0 => c1a c1b c2a c2b
    1 => c3b c3b c4b c4b
    2 => c3a c3a c4a c4a
    3 => c1b c1a c2b c2a
    

    I then multiply 0 and 2 together to get:

    0 => c1a * c3a, c1b * c3a, c2a * c4a, c2b * c4a
    

    Next I multiply 3 and 1 together to get:

    3 => c1b * c3b, c1a * c3b, c2b * c4b, c2a * c4b
    

    Finally I flip the signs of a couple of the floats in 3

    3 => -(c1b * c3b), c1a * c3b, -(c2b * c4b), c2a * c4b
    

    So I can add them together and get

    (c1a * c3a) - (c1b * c3b), (c1b * c3a ) + (c1a * c3b), (c2a * c4a) - (c2b * c4b), (c2b * c4a) + (c2a * c4b)
    

    Which is what we were after :)

提交回复
热议问题