In what situation would the AVX2 gather instructions be faster than individually loading the data?

后端 未结 2 1424
南笙
南笙 2020-12-08 10:09

I have been investigating the use of the new gather instructions of the AVX2 instruction set. Specifically, I decided to benchmark a simple problem, where one floating point

相关标签:
2条回答
  • 2020-12-08 10:39

    Newer microarchitectures have shifted the odds towards gather instructions. On an Intel Xeon Gold 6138 CPU @ 2.00 GHz with Skylake microarchitecture, we get for your benchmark:

    9.383e+09 8.86e+08 2.777e+09 6.915e+09 7.793e+09 8.335e+09 5.386e+09 4.92e+08 6.649e+09 1.421e+09 2.362e+09 2.7e+07 8.69e+09 5.9e+07 7.763e+09 3.926e+09 5.4e+08 3.426e+09 9.172e+09 5.736e+09 
    9.383e+09 8.86e+08 2.777e+09 6.915e+09 7.793e+09 8.335e+09 5.386e+09 4.92e+08 6.649e+09 1.421e+09 2.362e+09 2.7e+07 8.69e+09 5.9e+07 7.763e+09 3.926e+09 5.4e+08 3.426e+09 9.172e+09 5.736e+09 
    9.383e+09 8.86e+08 2.777e+09 6.915e+09 7.793e+09 8.335e+09 5.386e+09 4.92e+08 6.649e+09 1.421e+09 2.362e+09 2.7e+07 8.69e+09 5.9e+07 7.763e+09 3.926e+09 5.4e+08 3.426e+09 9.172e+09 5.736e+09 
    9.383e+09 8.86e+08 2.777e+09 6.915e+09 7.793e+09 8.335e+09 5.386e+09 4.92e+08 6.649e+09 1.421e+09 2.362e+09 2.7e+07 8.69e+09 5.9e+07 7.763e+09 3.926e+09 5.4e+08 3.426e+09 9.172e+09 5.736e+09 
    Array length 10000, function called 1000000 times.
    Gcc version: 6.32353
    Nonvectorized assembly implementation: 6.36922
    Vectorized without gather: 5.53553
    Vectorized with gather: 4.50673
    

    showing that gathers may now be well worth the effort.

    0 讨论(0)
  • 2020-12-08 10:44

    Unfortunately the gathered load instructions are not particularly "smart" - they seem to generate one bus cycle per element, regardless of the load addresses, so even if you happen to have contiguous elements there is apparently no internal logic for coalescing the loads. So in terms of efficiency a gathered load is no better than N scalar loads, except that it uses only one instruction.

    The only real benefit of the gather instructions is when you are implementing SIMD code anyway, and you need to load non-contiguous data to which you are then going to apply further SIMD operations. In that case a SIMD gathered load instruction will be a lot more efficient than a bunch of scalar code that would typically be generated by e.g. _mm256_set_xxx() (or a bunch of contiguous loads and permutes, etc, depending on the actual access pattern).

    0 讨论(0)
提交回复
热议问题