Minimize overhead in Python multiprocessing.Pool with numpy/scipy

后端 未结 1 1983
余生分开走
余生分开走 2021-02-04 11:49

I\'ve spent several hours on different attempts to parallelize my number-crunching code, but it only gets slower when I do so. Unfortunately, the problem disappears when I try t

相关标签:
1条回答
  • 2021-02-04 12:35

    Try to reduce interprocess communication. In the multiprocessing module all (single-computer) interprocess communication done through Queues. Objects passed through a Queue are pickled. So try to send fewer and/or smaller objects through the Queue.

    • Do not send self, the instance of BigData, through the Queue. It is rather big, and gets bigger as the amount the amount of data in self grows:

      In [6]: import pickle
      In [14]: len(pickle.dumps(BigData(50)))
      Out[14]: 1052187
      

      Every time pool.apply_async( _do_chunk_wrapper, (self, k, xi, yi)) is called, self is pickled in the main process and unpickled in the worker process. The size of len(pickle.dumps(BigData(N))) grows a N increases.

    • Let the data be read from a global variable. On Linux, you can take advantage of Copy-on-Write. As Jan-Philip Gehrcke explains:

      After fork(), parent and child are in an equivalent state. It would be stupid to copy the entire memory of the parent to another place in the RAM. That's [where] the copy-on-write principle [comes] in. As long as the child does not change its memory state, it actually accesses the parent's memory. Only upon modification, the corresponding bits and pieces are copied into the memory space of the child.

      Thus, you can avoid passing instances of BigData through the Queue by simply defining the instance as a global, bd = BigData(n), (as you are already doing) and referring to its values in the worker processes (e.g. _do_chunk_wrapper). It basically amounts to removing self from the call to pool.apply_async:

      p = pool.apply_async(_do_chunk_wrapper, (k_start, k_end, xi, yi))
      

      and accessing bd as a global, and making the necessary attendant changes to do_chunk_wrapper's call signature.

    • Try to pass longer-running functions, func, to pool.apply_async. If you have many quickly-completing calls to pool.apply_async then the overhead of passing arguments and return values through the Queue becomes a significant part of the overall time. If instead you make fewer calls to pool.apply_async and give each func more work to do before returning a result, then interprocess communication becomes a smaller fraction of the overall time.

      Below, I modified _do_chunk_wrapper to accept k_start and k_end arguments, so that each call to pool.apply_async would compute the sum for many values of k before returning a result.


    import math
    import numpy as np
    import time
    import sys
    import multiprocessing as mp
    import scipy.interpolate as interpolate
    
    _tm=0
    def stopwatch(msg=''):
        tm = time.time()
        global _tm
        if _tm==0: _tm = tm; return
        print("%s: %.2f seconds" % (msg, tm-_tm))
        _tm = tm
    
    class BigData:
        def __init__(self, n):
            z = np.random.uniform(size=n*n*n).reshape((n,n,n))
            self.ff = []
            for i in range(n):
                f = interpolate.RectBivariateSpline(
                    np.arange(n), np.arange(n), z[i], kx=1, ky=1)
                self.ff.append(f)
            self.n = n
    
        def do_chunk(self, k, xi, yi):
            n = self.n
            s = np.sum(np.exp(self.ff[k].ev(xi, yi)))
            sys.stderr.write(".")
            return s
    
        def do_chunk_of_chunks(self, k_start, k_end, xi, yi):
            s = sum(np.sum(np.exp(self.ff[k].ev(xi, yi)))
                        for k in range(k_start, k_end))
            sys.stderr.write(".")
            return s
    
        def do_multi(self, numproc, xi, yi):
            procs = []
            pool = mp.Pool(numproc)
            stopwatch('\nPool setup')
            ks = list(map(int, np.linspace(0, self.n, numproc+1)))
            for i in range(len(ks)-1):
                k_start, k_end = ks[i:i+2]
                p = pool.apply_async(_do_chunk_wrapper, (k_start, k_end, xi, yi))
                procs.append(p)
            stopwatch('Jobs queued (%d processes)' % numproc)
            total = 0.0
            for k, p in enumerate(procs):
                total += np.sum(p.get(timeout=30)) # timeout allows ctrl-C interrupt
                if k == 0: stopwatch("\nFirst get() done")
            print(total)
            stopwatch('Jobs done')
            pool.close()
            pool.join()
            return total
    
        def do_single(self, xi, yi):
            total = 0.0
            for k in range(self.n):
                total += self.do_chunk(k, xi, yi)
            stopwatch('\nAll in single process')
            return total
    
    def _do_chunk_wrapper(k_start, k_end, xi, yi): 
        return bd.do_chunk_of_chunks(k_start, k_end, xi, yi)        
    
    if __name__ == "__main__":
        stopwatch()
        n = 50
        bd = BigData(n)
        m = 1000*1000
        xi, yi = np.random.uniform(0, n, size=m*2).reshape((2,m))
        stopwatch('Initialized')
        bd.do_multi(2, xi, yi)
        bd.do_multi(3, xi, yi)
        bd.do_single(xi, yi)
    

    yields

    Initialized: 0.15 seconds
    
    Pool setup: 0.06 seconds
    Jobs queued (2 processes): 0.00 seconds
    
    First get() done: 6.56 seconds
    83963796.0404
    Jobs done: 0.55 seconds
    ..
    Pool setup: 0.08 seconds
    Jobs queued (3 processes): 0.00 seconds
    
    First get() done: 5.19 seconds
    83963796.0404
    Jobs done: 1.57 seconds
    ...
    All in single process: 12.13 seconds
    

    compared to the original code:

    Initialized: 0.10 seconds
    Pool setup: 0.03 seconds
    Jobs queued (2 processes): 0.00 seconds
    
    First get() done: 10.47 seconds
    Jobs done: 0.00 seconds
    ..................................................
    Pool setup: 0.12 seconds
    Jobs queued (3 processes): 0.00 seconds
    
    First get() done: 9.21 seconds
    Jobs done: 0.00 seconds
    ..................................................
    All in single process: 12.12 seconds
    
    0 讨论(0)
提交回复
热议问题