发表新帖

发表新帖

How to use the multiply and accumulate intrinsics in ARM Cortex-a8?

后端未结

关注

 3  1294

长发绾君心 2021-02-05 14:51

how to use the Multiply-Accumulate intrinsics provided by GCC?

float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t);

Can anyone expl

3条回答

小蘑菇 (楼主)

2021-02-05 15:02
```
result = vml (matrix[0], vector);
result = vmla (result, matrix[1], vector);
result = vmla (result, matrix[2], vector);
result = vmla (result, matrix[3], vector);
```
This sequence won't work, though. The problem is that x component accumulates only x modulated by the matrix rows and can be expressed as:
```
result.x = vector.x * (matrix[0][0] + matrix[1][0] + matrix[2][0] + matrix[3][0]);
```
...

The correct sequence would be:
```
result = vml (matrix[0], vector.xxxx);
result = vmla(result, matrix[1], vector.yyyy);
```
...

NEON and SSE don't have built-in selection for the fields (this would require 8 bits in instruction incoding, per vector register). GLSL/HLSL for example does have this kind of facilities so most GPUs have also.

Alternative way to achieve this would be:
```
result.x = dp4(vector, matrix[0]);
result.y = dp4(vector, matrix[1]);
```
... // and of course, the matrix would be transpose for this to yield same result

The mul,madd,madd,madd sequence is usually preferred as it does not require write mask for the target register fields.

Otherwise the code looks good. =)
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题