Demonstrator code failing to show 4 times faster SIMD speed with optimization disabled

前端未结

关注

 3  1666

不知归路 2021-01-21 17:23

I am trying to understand the benefit of using SIMD vectorization and wrote a simple demonstrator code to see what would be the speed gain of an algorithm leveraging vectorizati

3条回答

生来不讨喜 (楼主)

2021-01-21 17:24

That must be the instruction latency. (RAW dependency) While the ALU instructions have little to no latency, ie the results can be the operands for the next instruction without any delay, SIMD instructions tend to have long latencies until the results are available even for such simple ones like add.

Extend the arrays to 16 or even 32 entries long spanning over 4 or 8 SIMD vectors, and you will see huge differences thanks to instruction scheduling.

NOW: add v latency add v latency . . .

4 vector rotation: add v1 add v2 add v3 add v4 add v1 add v2 . . .

Google for "instruction scheduling" and "raw dependency" for more detailed infos.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...