Building backward compatible binaries with newer CPU instructions support

那年仲夏 提交于 2021-01-27 12:41:41

问题


What is the best way to implement multiple versions of the same function that uses a specific CPU instructions if available (tested at run time), or falls back to a slower implementation if not?

For example, x86 BMI2 provides a very useful PDEP instruction. How would I write a C code such that it tests BMI2 availability of the executing CPU on startup, and uses one of the two implementations -- one that uses _pdep_u64 call (available with -mbmi2), and another that does bit manipulation "by hand" using C code. Are there any built-in support for such cases? How would I make GCC compile for older arch while providing access to the newer intrinsic? I suspect execution is faster if the function is invoked via a global function pointer, rather than an if/else every time?


回答1:


You can declare a function pointer and point it to the correct version at program startup by calling cpuid to determine the current architecture

But it's better to utilize support from many modern compilers. Intel's ICC has automatic function dispatching to select the optimized version for each architecture long ago. I don't know the details but looks like it only applies to Intel's libraries. Besides it only dispatches to the efficient version on Intel CPUs, hence would be unfair to other manufacturers. There are many patches and workarounds for that in Agner`s CPU blog

Later a feature called Function Multiversioning was introduced in GCC 4.8. It adds the target attribute that you'll declare on each version of your function

__attribute__ ((target ("sse4.2")))
int foo() { return 1; }

__attribute__ ((target ("arch=atom")))
int foo() { return 2; }

int main() {
    int (*p)() = &foo;
    return foo() + p();
}

That duplicates a lot of code and is cumbersome so GCC 6 added target_clones that tells GCC to compile a function to multiple clones. For example __attribute__((target_clones("avx2","arch=atom","default"))) void foo() {} will create 3 different foo versions. More information about them can be found in GCC's documentation about function attribute

The syntax was then adopted by Clang and ICC. Performance can even be better than a global function pointer because the function symbols can be resolved at process loading time instead of runtime. It's one of the reasons Intel's Clear Linux runs so fast. ICC may also create multiple versions of a single loop during auto-vectorization

  • Function multi-versioning in GCC 6
  • Function Multi-Versioning
  • The - surprisingly limited - usefulness of function multiversioning in GCC
  • Generate code for multiple SIMD architectures

Here's an example from The one with multi-versioning (Part II) along with its demo which is about popcnt but you get the idea

__attribute__((target_clones("popcnt","default")))
int runPopcount64_builtin_multiarch_loop(const uint8_t* bitfield, int64_t size, int repeat) {
    int res = 0;
    const uint64_t* data = (const uint64_t*)bitfield;

    for (int r=0; r<repeat; r++)
    for (int i=0; i<size/8; i++) {
        res += popcount64_builtin_multiarch_loop(data[i]);
    }

    return res;
}

Note that PDEP and PEXT are very slow on current AMD CPUs so they should only be enabled on Intel



来源:https://stackoverflow.com/questions/61005492/building-backward-compatible-binaries-with-newer-cpu-instructions-support

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!