How to use the multiply and accumulate intrinsics in ARM Cortex-a8?

后端 未结 3 1294
长发绾君心
长发绾君心 2021-02-05 14:51

how to use the Multiply-Accumulate intrinsics provided by GCC?

float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t);

Can anyone expl

3条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2021-02-05 15:02

    result = vml (matrix[0], vector);
    result = vmla (result, matrix[1], vector);
    result = vmla (result, matrix[2], vector);
    result = vmla (result, matrix[3], vector);
    

    This sequence won't work, though. The problem is that x component accumulates only x modulated by the matrix rows and can be expressed as:

    result.x = vector.x * (matrix[0][0] + matrix[1][0] + matrix[2][0] + matrix[3][0]);
    

    ...

    The correct sequence would be:

    result = vml (matrix[0], vector.xxxx);
    result = vmla(result, matrix[1], vector.yyyy);
    

    ...

    NEON and SSE don't have built-in selection for the fields (this would require 8 bits in instruction incoding, per vector register). GLSL/HLSL for example does have this kind of facilities so most GPUs have also.

    Alternative way to achieve this would be:

    result.x = dp4(vector, matrix[0]);
    result.y = dp4(vector, matrix[1]);
    

    ... // and of course, the matrix would be transpose for this to yield same result

    The mul,madd,madd,madd sequence is usually preferred as it does not require write mask for the target register fields.

    Otherwise the code looks good. =)

提交回复
热议问题