I wrote a function that multiplies Eigen matrices of dimension 10x10 together. Then I wrote a naive multiply function CustomMultiply
which was surprisingly 2x faste
I've rewritten your code using a proper benchmark library, namely Google Benchmark and cannot reproduce your measurements.
My results for -O0
where the second template parameter is the matrix dimension:
Running ./benchmark
Run on (12 X 2900 MHz CPU s)
CPU Caches:
L1 Data 32K (x6)
L1 Instruction 32K (x6)
L2 Unified 262K (x6)
L3 Unified 12582K (x1)
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
BM_CustomMultiply 5391 ns 5389 ns 105066
BM_CustomMultiply 9365 ns 9364 ns 73649
BM_CustomMultiply 15349 ns 15349 ns 44008
BM_CustomMultiply 20953 ns 20947 ns 32230
BM_CustomMultiply 33328 ns 33318 ns 21584
BM_CustomMultiply 44237 ns 44230 ns 15500
BM_CustomMultiply 57142 ns 57140 ns 11953
BM_CustomMultiply 69382 ns 69382 ns 9998
BM_EigenMultiply 2335 ns 2335 ns 295458
BM_EigenMultiply 1613 ns 1613 ns 457382
BM_EigenMultiply 4791 ns 4791 ns 142992
BM_EigenMultiply 3471 ns 3469 ns 206002
BM_EigenMultiply 9052 ns 9051 ns 78135
BM_EigenMultiply 8655 ns 8655 ns 81717
BM_EigenMultiply 11446 ns 11399 ns 67001
BM_EigenMultiply 15092 ns 15053 ns 46924
As you can see the number of iterations Google Benchmark uses is order of magnitudes higher that your benchmark. Micro-benchmarking is extremely hard especially when you deal with execution times of a few hundred nanoseconds.
To be fair, calling your custom function involves a copy and manually inlining it gives a few nanoseconds, but still not beating Eigen.
Measurement with manually inlined CustomMultiply
and -O2 -DNDEBUG -march=native
:
Running ./benchmark
Run on (12 X 2900 MHz CPU s)
CPU Caches:
L1 Data 32K (x6)
L1 Instruction 32K (x6)
L2 Unified 262K (x6)
L3 Unified 12582K (x1)
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
BM_CustomMultiply 51 ns 51 ns 11108114
BM_CustomMultiply 88 ns 88 ns 7683611
BM_CustomMultiply 147 ns 147 ns 4642341
BM_CustomMultiply 213 ns 213 ns 3205627
BM_CustomMultiply 308 ns 308 ns 2246391
BM_CustomMultiply 365 ns 365 ns 1904860
BM_CustomMultiply 556 ns 556 ns 1254953
BM_CustomMultiply 661 ns 661 ns 1027825
BM_EigenMultiply 39 ns 39 ns 17918807
BM_EigenMultiply 69 ns 69 ns 9931755
BM_EigenMultiply 119 ns 119 ns 5801185
BM_EigenMultiply 178 ns 178 ns 3838772
BM_EigenMultiply 256 ns 256 ns 2692898
BM_EigenMultiply 385 ns 385 ns 1826598
BM_EigenMultiply 546 ns 546 ns 1271687
BM_EigenMultiply 644 ns 644 ns 1104798