问题
What is the best way to implement multiple versions of the same function that uses a specific CPU instructions if available (tested at run time), or falls back to a slower implementation if not?
For example, x86 BMI2 provides a very useful PDEP instruction. How would I write a C code such that it tests BMI2 availability of the executing CPU on startup, and uses one of the two implementations -- one that uses _pdep_u64
call (available with -mbmi2
), and another that does bit manipulation "by hand" using C code. Are there any built-in support for such cases? How would I make GCC compile for older arch while providing access to the newer intrinsic? I suspect execution is faster if the function is invoked via a global function pointer, rather than an if/else every time?
回答1:
You can declare a function pointer and point it to the correct version at program startup by calling cpuid
to determine the current architecture
But it's better to utilize support from many modern compilers. Intel's ICC has automatic function dispatching to select the optimized version for each architecture long ago. I don't know the details but looks like it only applies to Intel's libraries. Besides it only dispatches to the efficient version on Intel CPUs, hence would be unfair to other manufacturers. There are many patches and workarounds for that in Agner`s CPU blog
Later a feature called Function Multiversioning was introduced in GCC 4.8. It adds the target
attribute that you'll declare on each version of your function
__attribute__ ((target ("sse4.2")))
int foo() { return 1; }
__attribute__ ((target ("arch=atom")))
int foo() { return 2; }
int main() {
int (*p)() = &foo;
return foo() + p();
}
That duplicates a lot of code and is cumbersome so GCC 6 added target_clones
that tells GCC to compile a function to multiple clones. For example __attribute__((target_clones("avx2","arch=atom","default"))) void foo() {}
will create 3 different foo
versions. More information about them can be found in GCC's documentation about function attribute
The syntax was then adopted by Clang and ICC. Performance can even be better than a global function pointer because the function symbols can be resolved at process loading time instead of runtime. It's one of the reasons Intel's Clear Linux runs so fast. ICC may also create multiple versions of a single loop during auto-vectorization
- Function multi-versioning in GCC 6
- Function Multi-Versioning
- The - surprisingly limited - usefulness of function multiversioning in GCC
- Generate code for multiple SIMD architectures
Here's an example from The one with multi-versioning (Part II) along with its demo which is about popcnt but you get the idea
__attribute__((target_clones("popcnt","default")))
int runPopcount64_builtin_multiarch_loop(const uint8_t* bitfield, int64_t size, int repeat) {
int res = 0;
const uint64_t* data = (const uint64_t*)bitfield;
for (int r=0; r<repeat; r++)
for (int i=0; i<size/8; i++) {
res += popcount64_builtin_multiarch_loop(data[i]);
}
return res;
}
Note that PDEP and PEXT are very slow on current AMD CPUs so they should only be enabled on Intel
来源:https://stackoverflow.com/questions/61005492/building-backward-compatible-binaries-with-newer-cpu-instructions-support