I realize that an answer on a quite new solution is missing. If Numpy is used in the code, I would advice to try Pythran:
http://pythran.readthedocs.io/
For the functions I tried, Pythran gives extremely good results. The resulting functions are as fast as well written Fortran code (or only slightly slower) and a little bit faster than the (quite optimized) Cython solution.
The advantage compared to Cython is that you just have to use Pythran on the Python function optimized for Numpy, meaning that you do not have to expand the loops and add types for all variables in the loop. Pythran takes its time to analyse the code so it understands the operations on numpy.ndarray
.
It is also a huge advantage compared to Numba or other projects based on just-in-time compilation for which (to my knowledge), you have to expand the loops to be really efficient. And then the code with the loops becomes very very inefficient using only CPython and Numpy...
A drawback of Pythran: no classes! But since only the functions that really need to be optimized have to be compiled, it is not very annoying.
Another point: Pythran supports well (and very easily) OpenMP parallelism. But I don't think mpi4py is supported...