Is Eigen slow at multiplying small matrices?

后端 未结 2 1902
半阙折子戏
半阙折子戏 2021-02-06 08:19

I wrote a function that multiplies Eigen matrices of dimension 10x10 together. Then I wrote a naive multiply function CustomMultiply which was surprisingly 2x faste

2条回答
  •  被撕碎了的回忆
    2021-02-06 09:06

    I've rewritten your code using a proper benchmark library, namely Google Benchmark and cannot reproduce your measurements.

    My results for -O0 where the second template parameter is the matrix dimension:

    Running ./benchmark
    Run on (12 X 2900 MHz CPU s)
    CPU Caches:
      L1 Data 32K (x6)
      L1 Instruction 32K (x6)
      L2 Unified 262K (x6)
      L3 Unified 12582K (x1)
    ---------------------------------------------------------------------
    Benchmark                              Time           CPU Iterations
    ---------------------------------------------------------------------
    BM_CustomMultiply        5391 ns       5389 ns     105066
    BM_CustomMultiply        9365 ns       9364 ns      73649
    BM_CustomMultiply       15349 ns      15349 ns      44008
    BM_CustomMultiply       20953 ns      20947 ns      32230
    BM_CustomMultiply       33328 ns      33318 ns      21584
    BM_CustomMultiply       44237 ns      44230 ns      15500
    BM_CustomMultiply       57142 ns      57140 ns      11953
    BM_CustomMultiply      69382 ns      69382 ns       9998
    BM_EigenMultiply         2335 ns       2335 ns     295458
    BM_EigenMultiply         1613 ns       1613 ns     457382
    BM_EigenMultiply         4791 ns       4791 ns     142992
    BM_EigenMultiply         3471 ns       3469 ns     206002
    BM_EigenMultiply         9052 ns       9051 ns      78135
    BM_EigenMultiply         8655 ns       8655 ns      81717
    BM_EigenMultiply        11446 ns      11399 ns      67001
    BM_EigenMultiply       15092 ns      15053 ns      46924
    

    As you can see the number of iterations Google Benchmark uses is order of magnitudes higher that your benchmark. Micro-benchmarking is extremely hard especially when you deal with execution times of a few hundred nanoseconds.

    To be fair, calling your custom function involves a copy and manually inlining it gives a few nanoseconds, but still not beating Eigen.

    Measurement with manually inlined CustomMultiply and -O2 -DNDEBUG -march=native:

    Running ./benchmark
    Run on (12 X 2900 MHz CPU s)
    CPU Caches:
      L1 Data 32K (x6)
      L1 Instruction 32K (x6)
      L2 Unified 262K (x6)
      L3 Unified 12582K (x1)
    ---------------------------------------------------------------------
    Benchmark                              Time           CPU Iterations
    ---------------------------------------------------------------------
    BM_CustomMultiply          51 ns         51 ns   11108114
    BM_CustomMultiply          88 ns         88 ns    7683611
    BM_CustomMultiply         147 ns        147 ns    4642341
    BM_CustomMultiply         213 ns        213 ns    3205627
    BM_CustomMultiply         308 ns        308 ns    2246391
    BM_CustomMultiply         365 ns        365 ns    1904860
    BM_CustomMultiply         556 ns        556 ns    1254953
    BM_CustomMultiply        661 ns        661 ns    1027825
    BM_EigenMultiply           39 ns         39 ns   17918807
    BM_EigenMultiply           69 ns         69 ns    9931755
    BM_EigenMultiply          119 ns        119 ns    5801185
    BM_EigenMultiply          178 ns        178 ns    3838772
    BM_EigenMultiply          256 ns        256 ns    2692898
    BM_EigenMultiply          385 ns        385 ns    1826598
    BM_EigenMultiply          546 ns        546 ns    1271687
    BM_EigenMultiply         644 ns        644 ns    1104798
    

提交回复
热议问题