Why don't C++ compilers do better constant folding?

后端 未结 3 699
迷失自我
迷失自我 2021-01-30 16:18

I\'m investigating ways to speed up a large section of C++ code, which has automatic derivatives for computing jacobians. This involves doing some amount of work in the actual r

相关标签:
3条回答
  • 2021-01-30 16:29

    I was disappointed to find that, without fast-math enabled, neither GCC 8.2, Clang 6 or MSVC 19 were able to make any optimizations at all over the naive dot-product with a matrix full of 0s.

    They have no other choice unfortunately. Since IEEE floats have signed zeros, adding 0.0 is not an identity operation:

    -0.0 + 0.0 = 0.0 // Not -0.0!
    

    Similarly, multiplying by zero does not always yield zero:

    0.0 * Infinity = NaN // Not 0.0!
    

    So the compilers simply cannot perform these constant folds in the dot product while retaining IEEE float compliance - for all they know, your input might contain signed zeros and/or infinities.

    You will have to use -ffast-math to get these folds, but that may have undesired consequences. You can get more fine-grained control with specific flags (from http://gcc.gnu.org/wiki/FloatingPointMath). According to the above explanation, adding the following two flags should allow the constant folding:
    -ffinite-math-only, -fno-signed-zeros

    Indeed, you get the same assembly as with -ffast-math this way: https://godbolt.org/z/vGULLA. You only give up the signed zeros (probably irrelevant), NaNs and the infinities. Presumably, if you were to still produce them in your code, you would get undefined behavior, so weigh your options.


    As for why your example is not optimized better even with -ffast-math: That is on Eigen. Presumably they have vectorization on their matrix operations, which are much harder for compilers to see through. A simple loop is properly optimized with these options: https://godbolt.org/z/OppEhY

    0 讨论(0)
  • 2021-01-30 16:37

    This is because Eigen explicitly vectorize your code as 3 vmulpd, 2 vaddpd and 1 horizontal reduction within the remaining 4 component registers (this assumes AVX, with SSE only you'll get 6 mulpd and 5 addpd). With -ffast-math GCC and clang are allowed to remove the last 2 vmulpd and vaddpd (and this is what they do) but they cannot really replace the remaining vmulpd and horizontal reduction that have been explicitly generated by Eigen.

    So what if you disable Eigen's explicit vectorization by defining EIGEN_DONT_VECTORIZE? Then you get what you expected (https://godbolt.org/z/UQsoeH) but other pieces of code might become much slower.

    If you want to locally disable explicit vectorization and are not afraid of messing with Eigen's internal, you can introduce a DontVectorize option to Matrix and disable vectorization by specializing traits<> for this Matrix type:

    static const int DontVectorize = 0x80000000;
    
    namespace Eigen {
    namespace internal {
    
    template<typename _Scalar, int _Rows, int _Cols, int _MaxRows, int _MaxCols>
    struct traits<Matrix<_Scalar, _Rows, _Cols, DontVectorize, _MaxRows, _MaxCols> >
    : traits<Matrix<_Scalar, _Rows, _Cols> >
    {
      typedef traits<Matrix<_Scalar, _Rows, _Cols> > Base;
      enum {
        EvaluatorFlags = Base::EvaluatorFlags & ~PacketAccessBit
      };
    };
    
    }
    }
    
    using ArrayS12d = Eigen::Matrix<double,12,1,DontVectorize>;
    

    Full example there: https://godbolt.org/z/bOEyzv

    0 讨论(0)
  • 2021-01-30 16:40

    One way to force a compiler to optimize multiplications by 0's and 1`s is to manually unroll the loop. For simplicity let's use

    #include <array>
    #include <cstddef>
    constexpr std::size_t n = 12;
    using Array = std::array<double, n>;
    

    Then we can implement a simple dot function using fold expressions (or recursion if they are not available):

    <utility>
    template<std::size_t... is>
    double dot(const Array& x, const Array& y, std::index_sequence<is...>)
    {
        return ((x[is] * y[is]) + ...);
    }
    
    double dot(const Array& x, const Array& y)
    {
        return dot(x, y, std::make_index_sequence<n>{});
    }
    

    Now let's take a look at your function

    double test(const Array& b)
    {
        const Array a{1};    // = {1, 0, ...}
        return dot(a, b);
    }
    

    With -ffast-math gcc 8.2 produces:

    test(std::array<double, 12ul> const&):
      movsd xmm0, QWORD PTR [rdi]
      ret
    

    clang 6.0.0 goes along the same lines:

    test(std::array<double, 12ul> const&): # @test(std::array<double, 12ul> const&)
      movsd xmm0, qword ptr [rdi] # xmm0 = mem[0],zero
      ret
    

    For example, for

    double test(const Array& b)
    {
        const Array a{1, 1};    // = {1, 1, 0...}
        return dot(a, b);
    }
    

    we get

    test(std::array<double, 12ul> const&):
      movsd xmm0, QWORD PTR [rdi]
      addsd xmm0, QWORD PTR [rdi+8]
      ret
    

    Addition. Clang unrolls a for (std::size_t i = 0; i < n; ++i) ... loop without all these fold expressions tricks, gcc doesn't and needs some help.

    0 讨论(0)
提交回复
热议问题