how to use the Multiply-Accumulate intrinsics provided by GCC?
float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t);
Can anyone expl
result = vml (matrix[0], vector);
result = vmla (result, matrix[1], vector);
result = vmla (result, matrix[2], vector);
result = vmla (result, matrix[3], vector);
This sequence won't work, though. The problem is that x component accumulates only x modulated by the matrix rows and can be expressed as:
result.x = vector.x * (matrix[0][0] + matrix[1][0] + matrix[2][0] + matrix[3][0]);
...
The correct sequence would be:
result = vml (matrix[0], vector.xxxx);
result = vmla(result, matrix[1], vector.yyyy);
...
NEON and SSE don't have built-in selection for the fields (this would require 8 bits in instruction incoding, per vector register). GLSL/HLSL for example does have this kind of facilities so most GPUs have also.
Alternative way to achieve this would be:
result.x = dp4(vector, matrix[0]);
result.y = dp4(vector, matrix[1]);
... // and of course, the matrix would be transpose for this to yield same result
The mul,madd,madd,madd sequence is usually preferred as it does not require write mask for the target register fields.
Otherwise the code looks good. =)