Is there a fast C or C++ standard library function for double precision inverse square root?

前端 未结 7 2030
醉话见心
醉话见心 2021-01-18 01:26

I find myself typing

double foo=1.0/sqrt(...);

a lot, and I\'ve heard that modern processors have built-in inverse square root opcodes.

相关标签:
7条回答
  • 2021-01-18 02:06

    If you find yourself writing the same thing over and over, you should think to yourself "function!":

    double invsqrt(const double x)
    {
        return 1.0 / std::sqrt(x);
    }
    

    Now the code is more self-documenting: people don't have to deduce 1.0 / std::sqrt(x) is the inverse square root, they read it. Additionally, you now get to plug in whatever implementation you want and each call-site automatically uses the updated definition.

    To answer your question, no, there is no C(++) function for it, but now that you've made one if you find your performance is too lacking you can substitute your own definition.

    0 讨论(0)
  • 2021-01-18 02:11

    Violating constraints 1. and 2. (and it's also not standard), but it still might help someone browsing through...

    I used ASMJIT to just-in-time compile the exact assembly operation you're looking for: RSQRTSS (single precision, ok, but it should be similar with double).

    My code is this (cf. also my answer in a different post):

       typedef float(*JITFunc)();
    
       JITFunc func;
       asmjit::JitRuntime jit_runtime;
       asmjit::CodeHolder code;
       code.init(jit_runtime.getCodeInfo());
    
       asmjit::X86Compiler cc(&code);
       cc.addFunc(asmjit::FuncSignature0<float>());
    
       float value = 2.71; // Some example value.
       asmjit::X86Xmm x = cc.newXmm();
       uint32_t *i = reinterpret_cast<uint32_t*>(&value);
       cc.mov(asmjit::x86::eax, i[0]);
       cc.movd(x, asmjit::x86::eax);
    
       cc.rsqrtss(x, x);   // THE asm function.
    
       cc.ret(x);
    
       cc.endFunc();
       cc.finalize();
    
       jit_runtime.add(&func, &code);
    
       // Now, func() can be used as the result to rsqrt(value).
    

    If you do the JIT compilation part only once, calling it later with different values, this should be faster (though slightly less accurate, but this is inherent to the built-in operations you're talking about) than 1.0/sqrt(...).

    0 讨论(0)
  • 2021-01-18 02:17

    If your not afraid of using your own functions, try the following:

    template <typename T>
    T invsqrt(T x)
    {
        return 1.0 / std::sqrt(x);
    }
    

    It should be just as fast as the orginal 1.0 / std::sqrt(x) in any modernly optimized compiler. Also, it can be used with doubles or floats.

    0 讨论(0)
  • 2021-01-18 02:20

    You can use this function for faster inverse square root computing
    There's an article on wikipedia on how it works: https://en.wikipedia.org/wiki/Fast_inverse_square_root
    Also there's a C version of this algorithm.

    float invSqrt( float number ){
        union {
            float f;
            uint32_t i;
        } conv;
    
        float x2;
        const float threehalfs = 1.5F;
    
        x2 = number * 0.5F;
        conv.f  = number;
        conv.i  = 0x5f3759df - ( conv.i >> 1 );
        conv.f  = conv.f * ( threehalfs - ( x2 * conv.f * conv.f ) );
        return conv.f;
    }
    
    0 讨论(0)
  • 2021-01-18 02:23

    No. No, there isn't. Not in C++. Nope.

    0 讨论(0)
  • 2021-01-18 02:24

    I don't know of a standardized C API for this, but that does not mean you cannot use the fast inverse sqrt instructions, as long as you are willing to write platform dependent intrinsics.

    Let's take 64-bit x86 with AVX for example, where you can use _mm256_rsqrt_ps() to approximate the reciprocal of a square root. Or more specifically: 8 square-roots in a single go, using SIMD.

    #include <immintrin.h>
    
    ...
    
    float inputs[8] = { ... } __attribute__ ((aligned (32)));
    __m256 input = _mm256_load_ps(inputs);
    __m256 invroot = _mm256_rsqrt_ps(input);
    
    

    Similarly, you can use the intrinsic vrsqrteq_f32 on ARM with NEON. In this case, the SIMD is 4-wide, so it will compute four inverse square roots in a single go.

    #include <arm_neon.h>
    
    ...
    
    float32x4_t sqrt_reciprocal = vrsqrteq_f32(x);
    

    Even if you need just one root value per batch, it is still faster than a full square root. Just set the input in all, or one lane of the SIMD register. That way, you will not have to go through your memory with a load operation. On x86 that is done via _mm256_set1_ps(x).

    0 讨论(0)
提交回复
热议问题