I\'m studying simple multiplication of two big matrices using the Eigen library. This multiplication appears to be noticeably slower than both Matlab and Python for the same
The reason Matlab is faster is because it uses the Intel MKL. Eigen can use it too (see here), but you of course need to buy it.
That being said, there are a number of reasons Eigen can be slower. To compare python vs matlab vs Eigen, you'd really need to code three equivalent versions of an operations in the respective languages. Also note that Matlab caches results, so you'd really need to start from a fresh Matlab session to be sure its magic isn't fooling you.
Also, Matlab's Mex overhead is not nonexistent. The OP there reports newer versions "fix" the problem, but I'd be surprised if all overhead has been cleared completely.
Eigen doesn't take advantage of the AVX instructions that were introduced by Intel with the Sandy Bridge architecture. This probably explains most of the performance difference between Eigen and MATLAB. I found a branch that adds support for AVX at https://bitbucket.org/benoitsteiner/eigen but as far as I can tell it not been merged in the Eigen trunk yet.
First of all, when doing performance comparison, makes sure you disabled turbo-boost (TB). On my system, using gcc 4.5 from macport and without turbo-boost, I get 3.5s, that corresponds to 8.4 GFLOPS while the theoretical peak of my 2.3 core i7 is 9.2GFLOPS, so not too bad.
MatLab is based on Intel MKL, and seeing the reported performance, it clearly uses a multithreaded version. It is unlikely that an small library as Eigen can beat Intel on its own CPU!
Numpy can uses any BLAS library, Atlas, MKL, OpenBLAS, eigen-blas, etc. I guess that in your case it was using Atlas which is fast too.
Finally, here is how you can get better performance: enable multi-threading in Eigen by compiling with -fopenmp. By default Eigen uses for the number of the thread the default number of thread defined by OpenMP. Unfortunately this number corresponds to the number of logic cores, and not physical cores, so make sure hyper-threading is disabled or define the OMP_NUM_THREADS environment variable to the physical number of cores. Here I get 1.25s (without TB), and 0.95s with TB.