How to use the multiply and accumulate intrinsics in ARM Cortex-a8?

后端 未结 3 1292
长发绾君心
长发绾君心 2021-02-05 14:51

how to use the Multiply-Accumulate intrinsics provided by GCC?

float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t);

Can anyone expl

3条回答
  •  一向
    一向 (楼主)
    2021-02-05 15:26

    Simply said the vmla instruction does the following:

    struct 
    {
      float val[4];
    } float32x4_t
    
    
    float32x4_t vmla (float32x4_t a, float32x4_t b, float32x4_t c)
    {
      float32x4 result;
    
      for (int i=0; i<4; i++)
      {
        result.val[i] =  b.val[i]*c.val[i]+a.val[i];
      }
    
      return result;
    }
    

    And all this compiles into a singe assembler instruction :-)

    You can use this NEON-assembler intrinsic among other things in typical 4x4 matrix multiplications for 3D-graphics like this:

    float32x4_t transform (float32x4_t * matrix, float32x4_t vector)
    {
      /* in a perfect world this code would compile into just four instructions */
      float32x4_t result;
    
      result = vml (matrix[0], vector);
      result = vmla (result, matrix[1], vector);
      result = vmla (result, matrix[2], vector);
      result = vmla (result, matrix[3], vector);
    
      return result;
    }
    

    This saves a couple of cycles because you don't have to add the results after multiplication. The addition is so often used that multiply-accumulates hsa become mainstream these days (even x86 has added them in some recent SSE instruction set).

    Also worth mentioning: Multiply-accumulate operations like this are very common in linear algebra and DSP (digital signal processing) applications. ARM was very smart and implemented a fast-path inside the Cortex-A8 NEON-Core. This fast-path kicks in if the first argument (the accumulator) of a VMLA instruction is the result of a preceding VML or VMLA instruction. I could go into detail but in a nutshell such an instruction series runs four times faster than a VML / VADD / VML / VADD series.

    Take a look at my simple matrix-multiply: I did exactly that. Due to this fast-path it will run roughly four times faster than implementation written using VML and ADD instead of VMLA.

提交回复
热议问题