Is Eigen slow at multiplying small matrices?

后端 未结 2 1901
半阙折子戏
半阙折子戏 2021-02-06 08:19

I wrote a function that multiplies Eigen matrices of dimension 10x10 together. Then I wrote a naive multiply function CustomMultiply which was surprisingly 2x faste

相关标签:
2条回答
  • 2021-02-06 09:06

    I've rewritten your code using a proper benchmark library, namely Google Benchmark and cannot reproduce your measurements.

    My results for -O0 where the second template parameter is the matrix dimension:

    Running ./benchmark
    Run on (12 X 2900 MHz CPU s)
    CPU Caches:
      L1 Data 32K (x6)
      L1 Instruction 32K (x6)
      L2 Unified 262K (x6)
      L3 Unified 12582K (x1)
    ---------------------------------------------------------------------
    Benchmark                              Time           CPU Iterations
    ---------------------------------------------------------------------
    BM_CustomMultiply<double, 3>        5391 ns       5389 ns     105066
    BM_CustomMultiply<double, 4>        9365 ns       9364 ns      73649
    BM_CustomMultiply<double, 5>       15349 ns      15349 ns      44008
    BM_CustomMultiply<double, 6>       20953 ns      20947 ns      32230
    BM_CustomMultiply<double, 7>       33328 ns      33318 ns      21584
    BM_CustomMultiply<double, 8>       44237 ns      44230 ns      15500
    BM_CustomMultiply<double, 9>       57142 ns      57140 ns      11953
    BM_CustomMultiply<double, 10>      69382 ns      69382 ns       9998
    BM_EigenMultiply<double, 3>         2335 ns       2335 ns     295458
    BM_EigenMultiply<double, 4>         1613 ns       1613 ns     457382
    BM_EigenMultiply<double, 5>         4791 ns       4791 ns     142992
    BM_EigenMultiply<double, 6>         3471 ns       3469 ns     206002
    BM_EigenMultiply<double, 7>         9052 ns       9051 ns      78135
    BM_EigenMultiply<double, 8>         8655 ns       8655 ns      81717
    BM_EigenMultiply<double, 9>        11446 ns      11399 ns      67001
    BM_EigenMultiply<double, 10>       15092 ns      15053 ns      46924
    

    As you can see the number of iterations Google Benchmark uses is order of magnitudes higher that your benchmark. Micro-benchmarking is extremely hard especially when you deal with execution times of a few hundred nanoseconds.

    To be fair, calling your custom function involves a copy and manually inlining it gives a few nanoseconds, but still not beating Eigen.

    Measurement with manually inlined CustomMultiply and -O2 -DNDEBUG -march=native:

    Running ./benchmark
    Run on (12 X 2900 MHz CPU s)
    CPU Caches:
      L1 Data 32K (x6)
      L1 Instruction 32K (x6)
      L2 Unified 262K (x6)
      L3 Unified 12582K (x1)
    ---------------------------------------------------------------------
    Benchmark                              Time           CPU Iterations
    ---------------------------------------------------------------------
    BM_CustomMultiply<double, 3>          51 ns         51 ns   11108114
    BM_CustomMultiply<double, 4>          88 ns         88 ns    7683611
    BM_CustomMultiply<double, 5>         147 ns        147 ns    4642341
    BM_CustomMultiply<double, 6>         213 ns        213 ns    3205627
    BM_CustomMultiply<double, 7>         308 ns        308 ns    2246391
    BM_CustomMultiply<double, 8>         365 ns        365 ns    1904860
    BM_CustomMultiply<double, 9>         556 ns        556 ns    1254953
    BM_CustomMultiply<double, 10>        661 ns        661 ns    1027825
    BM_EigenMultiply<double, 3>           39 ns         39 ns   17918807
    BM_EigenMultiply<double, 4>           69 ns         69 ns    9931755
    BM_EigenMultiply<double, 5>          119 ns        119 ns    5801185
    BM_EigenMultiply<double, 6>          178 ns        178 ns    3838772
    BM_EigenMultiply<double, 7>          256 ns        256 ns    2692898
    BM_EigenMultiply<double, 8>          385 ns        385 ns    1826598
    BM_EigenMultiply<double, 9>          546 ns        546 ns    1271687
    BM_EigenMultiply<double, 10>         644 ns        644 ns    1104798
    
    0 讨论(0)
  • 2021-02-06 09:15
    (gdb) bt
    #0  0x00005555555679e3 in Eigen::internal::gemm_pack_rhs<double, long, Eigen::internal::const_blas_data_mapper<double, long, 0>, 4, 0, false, false>::operator()(double*, Eigen::internal::const_blas_data_mapper<double, long, 0> const&, long, long, long, long) ()
    #1  0x0000555555566654 in Eigen::internal::general_matrix_matrix_product<long, double, 0, false, double, 0, false, 0>::run(long, long, long, double const*, long, double const*, long, double*, long, double, Eigen::internal::level3_blocking<double, double>&, Eigen::internal::GemmParallelInfo<long>*) ()
    #2  0x0000555555565822 in BM_PairwiseMultiplyEachMatrixNoAlias(benchmark::State&) ()
    #3  0x000055555556d571 in benchmark::internal::(anonymous namespace)::RunInThread(benchmark::internal::Benchmark::Instance const*, unsigned long, int, benchmark::internal::ThreadManager*) ()
    #4  0x000055555556b469 in benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*, benchmark::BenchmarkReporter*) ()
    #5  0x000055555556a450 in main ()
    

    From stack trace, eigen's matrix multiplication is using a generic multiply method and loop through a dynamic matrix size. For custom implementation, clang aggressively vectorize it and unroll loop, so there's much less branching.

    Maybe there's some flag/option for eigen to generate code for this particular size to optimize.

    However, if the matrix size is bigger, the Eigen version will perform much better than custom.

    0 讨论(0)
提交回复
热议问题