how to use the Multiply-Accumulate intrinsics provided by GCC?
float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t);
Can anyone expl
Google'd for vmlaq_f32
, turned up the reference for the RVCT compiler tools. Here's what it says:
Vector multiply accumulate: vmla -> Vr[i] := Va[i] + Vb[i] * Vc[i]
...
float32x4_t vmlaq_f32 (float32x4_t a, float32x4_t b, float32x4_t c);
AND
The following types are defined to represent vectors. NEON vector data types are named according to the following pattern: <type><size>x<number of lanes>_t For example, int16x4_t is a vector containing four lanes each containing a signed 16-bit integer. Table E.1 lists the vector data types.
IOW, the return value from the function will be a vector containing 4 32-bit floats, and each element of the vector is calculated by multiplying the corresponding elements of b
and c
, and adding the contents of a
.
HTH
result = vml (matrix[0], vector);
result = vmla (result, matrix[1], vector);
result = vmla (result, matrix[2], vector);
result = vmla (result, matrix[3], vector);
This sequence won't work, though. The problem is that x component accumulates only x modulated by the matrix rows and can be expressed as:
result.x = vector.x * (matrix[0][0] + matrix[1][0] + matrix[2][0] + matrix[3][0]);
...
The correct sequence would be:
result = vml (matrix[0], vector.xxxx);
result = vmla(result, matrix[1], vector.yyyy);
...
NEON and SSE don't have built-in selection for the fields (this would require 8 bits in instruction incoding, per vector register). GLSL/HLSL for example does have this kind of facilities so most GPUs have also.
Alternative way to achieve this would be:
result.x = dp4(vector, matrix[0]);
result.y = dp4(vector, matrix[1]);
... // and of course, the matrix would be transpose for this to yield same result
The mul,madd,madd,madd sequence is usually preferred as it does not require write mask for the target register fields.
Otherwise the code looks good. =)
Simply said the vmla instruction does the following:
struct
{
float val[4];
} float32x4_t
float32x4_t vmla (float32x4_t a, float32x4_t b, float32x4_t c)
{
float32x4 result;
for (int i=0; i<4; i++)
{
result.val[i] = b.val[i]*c.val[i]+a.val[i];
}
return result;
}
And all this compiles into a singe assembler instruction :-)
You can use this NEON-assembler intrinsic among other things in typical 4x4 matrix multiplications for 3D-graphics like this:
float32x4_t transform (float32x4_t * matrix, float32x4_t vector)
{
/* in a perfect world this code would compile into just four instructions */
float32x4_t result;
result = vml (matrix[0], vector);
result = vmla (result, matrix[1], vector);
result = vmla (result, matrix[2], vector);
result = vmla (result, matrix[3], vector);
return result;
}
This saves a couple of cycles because you don't have to add the results after multiplication. The addition is so often used that multiply-accumulates hsa become mainstream these days (even x86 has added them in some recent SSE instruction set).
Also worth mentioning: Multiply-accumulate operations like this are very common in linear algebra and DSP (digital signal processing) applications. ARM was very smart and implemented a fast-path inside the Cortex-A8 NEON-Core. This fast-path kicks in if the first argument (the accumulator) of a VMLA instruction is the result of a preceding VML or VMLA instruction. I could go into detail but in a nutshell such an instruction series runs four times faster than a VML / VADD / VML / VADD series.
Take a look at my simple matrix-multiply: I did exactly that. Due to this fast-path it will run roughly four times faster than implementation written using VML and ADD instead of VMLA.