AVX 512 vs AVX2 performance for simple array processing loops [closed]

假装没事ソ 提交于 2020-05-13 14:49:05

问题


I'm currently working on some optimizations and comparing vectorization possibilities for DSP applications, that seem ideal for AVX512, since these are just simple uncorrelated array processing loops. But on a new i9 I didn't measure any reasonable improvements when using AVX512 compared to AVX2. Any pointers? Any good results? (btw. I tried MSVC/CLANG/ICL, no noticeable difference, many times AVX512 code actually seems slower)


回答1:


This seems too broad, but there are actually some microarchitectural details worth mentioning.

Note that AVX512-VL (Vector Length) lets you use new AVX512 instructions (like packed uint64_t <-> double conversion, mask registers, etc) on 128 and 256-bit vectors. Modern compilers typically auto-vectorize with 256-bit vectors when tuning for Skylake-AVX512, aka Skylake-X. e.g. gcc -march=native or gcc -march=skylake-avx512, unless you override the tuning options to set the preferred vector width to 512 for code where the tradeoffs are worth it. See @zam's answer.


Some major things with 512-bit vectors (not 256-bit with AVX512 instruction like vpxord ymm30, ymm29, ymm10) on Skylake-X are:

  • Aligning your data to the vector width is more important than with AVX2 (every unaligned load crosses a cache-line boundary, instead of every other while looping over an array). In practice it makes a bigger difference. I totally forget the exact results of something I tested a while ago, but maybe 20% slowdown vs. under 5% from misalignment.

  • Running 512-bit uops shuts down the vector ALU on port 1. (But not the integer execution units on port 1). Some Skylake-X CPUs (e.g. Xeon Bronze) only have 1 per clock 512-bit FMA throughput, but i7 / i9 Skylake-X CPUs, and the higher-end Xeons, have an extra 512-bit FMA unit on port 5 that powers up for AVX512 "mode".

    So plan accordingly: you won't get double speed from widening to AVX512, and the bottleneck in your code might now be in the back-end.

  • Running 512-bit uops also limits your max Turbo, so wall-clock speedups can be lower than core-clock-cycle speedups. There are two levels of Turbo reduction: any 512-bit operation at all, and then heavy 512-bit, like sustained FMAs.

  • The FP divide execution unit for vsqrtps/pd zmm and vdivps/pd is not full width; it's only 128-bit wide, so the ratio of div/sqrt vs. multiply throughput is worse by about another factor of 2. See Floating point division vs floating point multiplication. SKX throughput for vsqrtps xmm/ymm/zmm is one per 3/6/12 cycles. double-precision is the same ratios but worse throughput and latency.

    Up to 256-bit YMM vectors, the latency is the same as XMM (12 cycles for sqrt), but for 512-bit ZMM the latency goes up to 20 cycles, and it takes 3 uops. (https://agner.org/optimize/ for instruction tables.)

    If you bottleneck on the divider and can't get more other instructions in the mix, VRSQRT14PS is worth considering even if you need a Newton iteration to get enough precision. But note that AVX512's approximate 1/sqrt(x) does have more guaranteed-accuracy bits than AVX/SSE.)


As far as auto-vectorization, if there are any shuffles required, compilers might do a worse job with wider vectors. For simple pure-vertical stuff, compilers can do ok with AVX512.

Your previous question had a sin function, and maybe if the compiler / SIMD math library only has a 256-bit version of that it won't auto-vectorize with AVX512.

If AVX512 doesn't help, maybe you're bottlenecked on memory bandwidth. Profile with performance counters and find out. Or try more repeats of smaller buffer sizes and see if it speeds up significantly when your data is hot in cache. If so, try to cache-block your code, or increase computational intensity by doing more in one pass over the data.

AVX512 does double theoretical max FMA throughput on an i9 (and integer multiply, and many other things that run on the same execution unit), making the mismatch between DRAM and execution units twice as big. So there's twice as much to gain from making better use of L2 / L1d cache.

Working with data while it's already loaded in registers is good.




回答2:


How did you compile (enable AVX512) your code in case of ICL or GCC? There are two "operating modes" for AVX-512 codes:

  1. For fresh Intel Compiler (starting 18.0 / 17.0.5), if using [Qa]xCORE-AVX512, you'll only enable AVX-512-VL which basically means AVX512 ISA but with 256bits-wide operands. This also seems to be default behavior for GCC.
  2. Otherwise, if (a) using older Intel Compiler, or (b) using [Qa]xCOMMON-AVX512 or (c) if using special new flag [Q/q]opt-zmm-usage=high, you'll get full AVX-512 ISA with 512-bits-wide operands. (given sophisticated flags logic is described here). This mode can also be enabled using -mprefer-vector-width=512 in case of GCC8 or newer.

If your code is "AVX512-friendly" (you have long sequences of well-vectorized codes without scalar pieces of code "interrupting" sequence of vector instructions), the mode (2) is way preferrable and you have to enable it (which is not by default).

Otherwise, if your code is not very AVX512-friendly (many non-vectorized pieces of code in between of vector code), then due to SKX "frequency throttling" AVX512VL could be sometimes more beneficial (at least until you do more code vectorization) and therefore you should make sure you are operating in mode (1). The landscape with frequencies vs. ISA is for example described in Dr. Lemier blogs (although the picture given in blog is a bit overpessimistic compared to reality) : https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/ and https://lemire.me/blog/2018/08/13/the-dangers-of-avx-512-throttling-myth-or-reality/



来源:https://stackoverflow.com/questions/52523349/avx-512-vs-avx2-performance-for-simple-array-processing-loops

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!