I am trying to understand the benefit of using SIMD vectorization and wrote a simple demonstrator code to see what would be the speed gain of an algorithm leveraging vectorizati
That must be the instruction latency. (RAW dependency) While the ALU instructions have little to no latency, ie the results can be the operands for the next instruction without any delay, SIMD instructions tend to have long latencies until the results are available even for such simple ones like add.
Extend the arrays to 16 or even 32 entries long spanning over 4 or 8 SIMD vectors, and you will see huge differences thanks to instruction scheduling.
NOW: add v latency add v latency . . .
4 vector rotation: add v1 add v2 add v3 add v4 add v1 add v2 . . .
Google for "instruction scheduling" and "raw dependency" for more detailed infos.