UPDATE
Unfortunately, due to my oversight, I had an older version of MKL (11.1) linked against numpy. Newer version of MKL (11.3.1) gives same performan
I suspect this is due to unfortunate thread scheduling. I was able to reproduce an effect similar to yours. Python was running at ~2.2 s, while the C version was showing huge variations from 1.4-2.2 s.
Applying:
KMP_AFFINITY=scatter,granularity=thread
This ensures that the 28 threads are always running on the same processor thread.
Reduces both runtimes to more stable ~1.24 s for C and ~1.26 s for python.
This is on a 28 core dual socket Xeon E5-2680 v3 system.
Interestingly, on a very similar 24 core dual socket Haswell system, both python and C perform almost identical even without thread affinity / pinning.
Why does python affect the scheduling? Well I assume there is more runtime environment around it. Bottom line is, without pinning your performance results will be non-deterministic.
Also you need to consider, that the Intel OpenMP runtime spawns an extra management thread that can confuse the scheduler. There are more choices for pinning, for instance KMP_AFFINITY=compact
- but for some reason that is totally messed up on my system. You can add ,verbose
to the variable to see how the runtime is pinning your threads.
likwid-pin is a useful alternative providing more convenient control.
In general single precision should be at least as fast as double precision. Double precision can be slower because:
I would think that once you get rid of the performance anomaly, this will be reflected in your numbers.
When you scale up the number of threads for MKL/*gemm, consider
I don't think there is a really simple way to measure how your application is affected by bad scheduling. You can expose this with perf trace -e sched:sched_switch
and there is some software to visualize this, but this will come with a high learning curve. And then again - for parallel performance analysis you should have the threads pinned anyway.