Demonstrator code failing to show 4 times faster SIMD speed with optimization disabled

前端 未结 3 1666
不知归路
不知归路 2021-01-21 17:23

I am trying to understand the benefit of using SIMD vectorization and wrote a simple demonstrator code to see what would be the speed gain of an algorithm leveraging vectorizati

3条回答
  •  生来不讨喜
    2021-01-21 17:24

    That must be the instruction latency. (RAW dependency) While the ALU instructions have little to no latency, ie the results can be the operands for the next instruction without any delay, SIMD instructions tend to have long latencies until the results are available even for such simple ones like add.

    Extend the arrays to 16 or even 32 entries long spanning over 4 or 8 SIMD vectors, and you will see huge differences thanks to instruction scheduling.

    NOW: add v latency add v latency . . .

    4 vector rotation: add v1 add v2 add v3 add v4 add v1 add v2 . . .

    Google for "instruction scheduling" and "raw dependency" for more detailed infos.

提交回复
热议问题