performance of coefficient-wise array operations of the eigen library with mkl backend

懵懂的女人 提交于 2019-12-24 04:33:07

问题


I am porting a Matlab algorithm with lots of coefficient-wise array operations to C++, which look like this example, but are often much more complex:

Eigen::Array<double, Dynamic, 1> tx2(12);
tx2 << 1,2,3,4,5,6;
Eigen::Array<double, Dynamic, 1> tx1(12);
tx1 << 7,8,9,10,11,12;
Eigen::Array<double, Dynamic, 1> x = (tx1 + tx2) / 2;

The C++ code turned out to be significantly slower than Matlab (around 20%). So in a next step I tried to turn on the Intel MKL implementation of Eigen, which did nothing for the performance, like literally no improvement. Is it possible that MKL does not improve coefficient-wise vector operations? Is there a way to test if I linked MKL sucessfully? Are there faster alternatives to the Eigen::vector classes? Thanks in advance!

Edit: I`m using VS 2013 on an i7-3820 running win7 64bit. Longer Example would be:

    Array<double, Dynamic, 1> ts = (k2 / (6 * b.pow(3)) + k / b - b / 2) - (k2 / (6 * a.pow(3)) + k / a - a / 2);
    Array<double, Dynamic, 1> tp1 = -2 * r2*(b - a)/ (rp.pow(2));
    Array<double, Dynamic, 1> tp2 = -2 * r2*rp*log(b / a) / rm2;
    Array<double, Dynamic, 1> tp3 = r2*(b.pow(-1) - a.pow (-1)) / 2;
    Array<double, Dynamic, 1> tp4 = 16 * r2.pow(2)*(r2.pow(2) + 1)*log((2 * rp*b - rm2) / (2 * rp*a - rm2)) / (rp.pow(3)*rm2);
    Array<double, Dynamic, 1> tp5 = 16 * r2.pow(3)*((2 * rp*b - rm2).pow(-1) - (2 * rp*a - rm2).pow(-1)) / rp.pow(3);
    Array<double, Dynamic, 1> tp = tp1 + tp2 + tp3 + tp4 + tp5;
    Array<double, Dynamic, 1> f = (ts + tp) / (2 * ds*ds);

relevant part of CMakeLists

    set (CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}")
    set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
    target_link_libraries(MK ${VTK_LIBRARIES} ${Boost_LIBRARIES} mkl_intel_lp64_dll.lib mkl_intel_thread_dll.lib mkl_core_dll.lib libiomp5md.lib)

and I've only defined EIGEN_USE_MKL_ALL so far.


回答1:


In short, if you have Intel's C++ compiler, use that.

I constructed a MCVE to test a few of the assumptions made here. We want to test

  1. Linking of MKL
  2. Eigen's vectorization of
    1. Addition
    2. Multiplication
    3. pow(double)
  3. Compilers' effects

with Visual Studio 2013.

#include <iostream>

//#define EIGEN_DONT_VECTORIZE

// SSE>2 doesn't affect these tests
#ifndef EIGEN_DONT_VECTORIZE // Not needed with Intel C++ Compiler XE 15.0
    #define EIGEN_VECTORIZE_SSE4_2
    #define EIGEN_VECTORIZE_SSE4_1
    #define EIGEN_VECTORIZE_SSSE3
    #define EIGEN_VECTORIZE_SSE3
#endif

#define EIGEN_USE_MKL_ALL 

#include <Eigen/Core>
#include <ctime>
#include <chrono>

#include <mkl.h>

int main(int argc, char* argv[])
{
    srand(time(NULL));
    std::cout << Eigen::SimdInstructionSetsInUse() << "\n";

    int sz = 32 * 1024 * 1024;

    double dummyAdd, dummyMult, dummyPow;

    // Quick test to show linking worked
    {
        float a[16] = {23.54f};
        float r[16] = {0.f};
        float b = 2.f;

        vsPowx(4, a, b, r);
        std::cout << r[0] << "\n";
    }

    Eigen::ArrayXd v1 = Eigen::ArrayXd::Random(sz);
    Eigen::ArrayXd v2 = Eigen::ArrayXd::Random(sz);
    Eigen::ArrayXd v3 = Eigen::ArrayXd::Random(sz);

    auto startTime = std::chrono::high_resolution_clock::now();
    {
        v3 = v1 + v2;
        dummyAdd = v3.sum();
    }
    auto endTime = std::chrono::high_resolution_clock::now();

    std::cout << "Total Time (addition) " << dummyAdd << " = " <<
        std::chrono::duration_cast<std::chrono::milliseconds>(endTime - startTime).count()
        << " milliseconds.\n";

    startTime = std::chrono::high_resolution_clock::now();
    {
        v1 = v3 * v2;   // 
        dummyMult = v1.sum();
    }
    endTime = std::chrono::high_resolution_clock::now();

    std::cout << "Total Time (multiplication) " << dummyMult << " = " <<
        std::chrono::duration_cast<std::chrono::milliseconds>(endTime - startTime).count()
        << " milliseconds.\n";

    startTime = std::chrono::high_resolution_clock::now();
    {
        v3 = v1.pow(3.5);   // 
        dummyPow = v3.sum();
    }
    endTime = std::chrono::high_resolution_clock::now();

    std::cout << "Total Time (pow(3.5)) " << dummyPow << " = " <<
        std::chrono::duration_cast<std::chrono::milliseconds>(endTime - startTime).count()
        << " milliseconds.\n";

    return 0;

}

I then compiled using cl (VS's compiler) and Intel C++ Compiler XE 15.0, both with and without EIGEN_DONT_VECTORIZE and EIGEN_USE_MKL_ALL. I compiled without omp for these tests. I got (i5 3470) interesting results. For cl, I saw no difference whether or not MKL was linked, but a slight

None
554.132
Total Time (addition) -2006.37 = 130 milliseconds.
Total Time (multiplication) 1.11832e+007 = 137 milliseconds.
Total Time (pow(3.5)) -1.#IND = 1730 milliseconds.

and

SSE, SSE2
554.132
Total Time (addition) -689.959 = 86 milliseconds.
Total Time (multiplication) 1.1175e+007 = 87 milliseconds.
Total Time (pow(3.5)) -1.#IND = 1695 milliseconds.

So we see that the addition and multiplication appear to be vectorized, but pow is not affected by MKL.

The Intel compiler showed similar results in behavior, but better with pow.

None
554.132
Total Time (addition) 7594.98 = 96 milliseconds.
Total Time (multiplication) 1.11818e+007 = 94 milliseconds.
Total Time (pow(3.5)) -1.#IND = 921 milliseconds.

and

SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
554.132
Total Time (addition) -1953.37 = 87 milliseconds.
Total Time (multiplication) 1.11796e+007 = 87 milliseconds.
Total Time (pow(3.5)) -1.#IND = 838 milliseconds.

without EIGEN_USE_MKL_ALL and

SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
554.132
Total Time (addition) 1512.55 = 87 milliseconds.
Total Time (multiplication) 1.11759e+007 = 89 milliseconds.
Total Time (pow(3.5)) -1.#IND = 843 milliseconds.

with EIGEN_USE_MKL_ALL.

I can understand Intel's compilers tendency to super optimize code up to the point of matching MKL's performance. I would have expected to see some difference in the cl performance. Bottom line, use the Intel C++ compiler if you need better performance.




回答2:


Replace calls to pow(2), pow(3), and the likes to square(), cube(). Same for pow(-1) which is advantageously replaced by a division. I hope MatLab is able to do all these kind of optimizations for you, but in C++, only working at the compiler level would make such compile-time optimizations possible.



来源:https://stackoverflow.com/questions/32092971/performance-of-coefficient-wise-array-operations-of-the-eigen-library-with-mkl-b

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!