问题
I'm currently trying to compile software for the use on a HPC-Cluster using Intel compilers. The login-node, which is where I compile and prepare the computations uses Intel Xeon Gold 6148 Processors, while the compute nodes use either Haswell- (Intel Xeon E5-2660 v3 / Intel Xeon Processor E5-2680 v3) or Skylake-processors (Intel Xeon Gold 6138).
As far as I understand from the links above, my login-node supports Intel SSE4.2, Intel AVX, Intel AVX2, as well as Intel AVX-512 but my compute nodes only support either Intel AVX2 (Haswell) or Intel AVX-512 (Skylake)
If I compile with the option -xHost
on the login node, it should automatically use the highest instruction set available. But which one is the highest? And how can I ensure, that my program runs on both compute-systems with best performance? Do I have to compile two versions?
Bonus question: Which -march
do I have to specify in this case?
回答1:
Since you are using Intel Compiler, you can use its "Automatic Processor Dispatch" capability in order to create "fat" generic binaries, which contain both SSE-compatible , AVX-compatible and so on versions altogether. So when you run your "fat" binary on SSE-only machine, then only SSE-optimized part (codepath) of your binary will be executed. When you run the SAME "fat" binary on AVX machine, then AVX-optimized part of your binary will be executed. This is very powerful and not so well known feature.
You can eanble it using combination of -ax and -x Intel Compiler compilation flags. The idea is that basically you specify the highest ISA(s) via -ax and the default/"lowest" ISA via -x.
Given "-ax" fat binaries technique is briefly described at https://www.chpc.utah.edu/documentation/software/single-executable.php#submit
More details can be found at page 9 of given nice foil-deck: https://www.alcf.anl.gov/files/ken_intel_compiler_optimization.pdf
Finally, I should mention, that in your description you've slightly confused ISAs relationship. Intel x86 processors with AVX512 - will always be supporting AVX2. AVX2 machines will always support SSE. The super oversimplified explanation of that : AVX512 is kinda super-set of AVX/AVX2, while AVX/AVX2 can be seen as a super set of SSE (de facto it is not, but still SSE is always available on AVX machines, but not vice versa).
Whatever the case you've mentioned Haswell (AVX2 machine, so SSE is in board, but naturally no AVX512 here) and Skylake (AVX512 machine, so AVX2 and SSE are on board). Therefore you probably need something like -axCORE-AVX512 -xCORE-AVX2 (in your list there is no machines below AVX2 - ie no SSE or AVX(1) machines). You seem to only have Skylake server and Haswell server.
回答2:
Take a look at Function Multiversioning. Although it is not a perfect solution for your problem, it seems like a good candidate...
来源:https://stackoverflow.com/questions/62215122/which-avx-and-march-should-be-specified-on-a-cluster-with-different-architecture