Basically I have a problem that is pretty much embrassing parallel and I think I\'ve hit the limits of how fast I can make it with plain python & multiprocessing so I\'m
This question is from 3 years ago and nowadays Cython has available functions that support the OpenMP backend. See for example the documentation here. One very convenient function is the prange
. This is one example of how a (rather naive) dot
function could be implemented using prange
.
Don't forget to compile passing the "/opemmp"
argument to the C compiler.
import numpy as np
cimport numpy as np
import cython
from cython.parallel import prange
ctypedef np.double_t cDOUBLE
DOUBLE = np.float64
def mydot(np.ndarray[cDOUBLE, ndim=2] a, np.ndarray[cDOUBLE, ndim=2] b):
cdef np.ndarray[cDOUBLE, ndim=2] c
cdef int i, M, N, K
c = np.zeros((a.shape[0], b.shape[1]), dtype=DOUBLE)
M = a.shape[0]
N = a.shape[1]
K = b.shape[1]
for i in prange(M, nogil=True):
multiply(&a[i,0], &b[0,0], &c[i,0], N, K)
return c
@cython.wraparound(False)
@cython.boundscheck(False)
@cython.nonecheck(False)
cdef void multiply(double *a, double *b, double *c, int N, int K) nogil:
cdef int j, k
for j in range(N):
for k in range(K):
c[k] += a[j]*b[k+j*K]
This youtube talk by Stefan Behnel, one of the core developers of Cython, will give you an amazing intro. Multithreading of a loop is at the last 30 mins (prange
section). The code is a zipped set of ipython notebooks downloadable here.
In short, write your optimized unthreaded code, optimize with Cython types, and multithread by replacing range
and releasing the GIL.
I've no experience with OpenMP, but you may have luck with trying zeromq (python bindings included):
easy_install pyzmq
If somebody stumbles over this question:
Now, there is direct support for OpenMP in cython via the cython.parallel module, see http://docs.cython.org/src/userguide/parallelism.html
According to the cython wiki, the developers have thought about a variety of options, but I don't believe they have implemented anything yet.
If your problem is embarrassingly parallel, and you already have a multi-processing solution, why not just get each worker process to call some cython code instead of python code?