Can't pickle when using multiprocessing Pool.map()

后端 未结 12 1776
醉梦人生
醉梦人生 2020-11-22 00:19

I\'m trying to use multiprocessing\'s Pool.map() function to divide out work simultaneously. When I use the following code, it works fine:

相关标签:
12条回答
  • 2020-11-22 00:55

    Some limitations though to Steven Bethard's solution :

    When you register your class method as a function, the destructor of your class is surprisingly called every time your method processing is finished. So if you have 1 instance of your class that calls n times its method, members may disappear between 2 runs and you may get a message malloc: *** error for object 0x...: pointer being freed was not allocated (e.g. open member file) or pure virtual method called, terminate called without an active exception (which means than the lifetime of a member object I used was shorter than what I thought). I got this when dealing with n greater than the pool size. Here is a short example :

    from multiprocessing import Pool, cpu_count
    from multiprocessing.pool import ApplyResult
    
    # --------- see Stenven's solution above -------------
    from copy_reg import pickle
    from types import MethodType
    
    def _pickle_method(method):
        func_name = method.im_func.__name__
        obj = method.im_self
        cls = method.im_class
        return _unpickle_method, (func_name, obj, cls)
    
    def _unpickle_method(func_name, obj, cls):
        for cls in cls.mro():
            try:
                func = cls.__dict__[func_name]
            except KeyError:
                pass
            else:
                break
        return func.__get__(obj, cls)
    
    
    class Myclass(object):
    
        def __init__(self, nobj, workers=cpu_count()):
    
            print "Constructor ..."
            # multi-processing
            pool = Pool(processes=workers)
            async_results = [ pool.apply_async(self.process_obj, (i,)) for i in range(nobj) ]
            pool.close()
            # waiting for all results
            map(ApplyResult.wait, async_results)
            lst_results=[r.get() for r in async_results]
            print lst_results
    
        def __del__(self):
            print "... Destructor"
    
        def process_obj(self, index):
            print "object %d" % index
            return "results"
    
    pickle(MethodType, _pickle_method, _unpickle_method)
    Myclass(nobj=8, workers=3)
    # problem !!! the destructor is called nobj times (instead of once)
    

    Output:

    Constructor ...
    object 0
    object 1
    object 2
    ... Destructor
    object 3
    ... Destructor
    object 4
    ... Destructor
    object 5
    ... Destructor
    object 6
    ... Destructor
    object 7
    ... Destructor
    ... Destructor
    ... Destructor
    ['results', 'results', 'results', 'results', 'results', 'results', 'results', 'results']
    ... Destructor
    

    The __call__ method is not so equivalent, because [None,...] are read from the results :

    from multiprocessing import Pool, cpu_count
    from multiprocessing.pool import ApplyResult
    
    class Myclass(object):
    
        def __init__(self, nobj, workers=cpu_count()):
    
            print "Constructor ..."
            # multiprocessing
            pool = Pool(processes=workers)
            async_results = [ pool.apply_async(self, (i,)) for i in range(nobj) ]
            pool.close()
            # waiting for all results
            map(ApplyResult.wait, async_results)
            lst_results=[r.get() for r in async_results]
            print lst_results
    
        def __call__(self, i):
            self.process_obj(i)
    
        def __del__(self):
            print "... Destructor"
    
        def process_obj(self, i):
            print "obj %d" % i
            return "result"
    
    Myclass(nobj=8, workers=3)
    # problem !!! the destructor is called nobj times (instead of once), 
    # **and** results are empty !
    

    So none of both methods is satisfying...

    0 讨论(0)
  • 2020-11-22 00:57

    All of these solutions are ugly because multiprocessing and pickling is broken and limited unless you jump outside the standard library.

    If you use a fork of multiprocessing called pathos.multiprocesssing, you can directly use classes and class methods in multiprocessing's map functions. This is because dill is used instead of pickle or cPickle, and dill can serialize almost anything in python.

    pathos.multiprocessing also provides an asynchronous map function… and it can map functions with multiple arguments (e.g. map(math.pow, [1,2,3], [4,5,6]))

    See: What can multiprocessing and dill do together?

    and: http://matthewrocklin.com/blog/work/2013/12/05/Parallelism-and-Serialization/

    >>> import pathos.pools as pp
    >>> p = pp.ProcessPool(4)
    >>> 
    >>> def add(x,y):
    ...   return x+y
    ... 
    >>> x = [0,1,2,3]
    >>> y = [4,5,6,7]
    >>> 
    >>> p.map(add, x, y)
    [4, 6, 8, 10]
    >>> 
    >>> class Test(object):
    ...   def plus(self, x, y): 
    ...     return x+y
    ... 
    >>> t = Test()
    >>> 
    >>> p.map(Test.plus, [t]*4, x, y)
    [4, 6, 8, 10]
    >>> 
    >>> p.map(t.plus, x, y)
    [4, 6, 8, 10]
    

    And just to be explicit, you can do exactly want you wanted to do in the first place, and you can do it from the interpreter, if you wanted to.

    >>> import pathos.pools as pp
    >>> class someClass(object):
    ...   def __init__(self):
    ...     pass
    ...   def f(self, x):
    ...     return x*x
    ...   def go(self):
    ...     pool = pp.ProcessPool(4)
    ...     print pool.map(self.f, range(10))
    ... 
    >>> sc = someClass()
    >>> sc.go()
    [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
    >>> 
    

    Get the code here: https://github.com/uqfoundation/pathos

    0 讨论(0)
  • 2020-11-22 00:59

    You could also define a __call__() method inside your someClass(), which calls someClass.go() and then pass an instance of someClass() to the pool. This object is pickleable and it works fine (for me)...

    0 讨论(0)
  • The solution from parisjohn above works fine with me. Plus the code looks clean and easy to understand. In my case there are a few functions to call using Pool, so I modified parisjohn's code a bit below. I made call to be able to call several functions, and the function names are passed in the argument dict from go():

    from multiprocessing import Pool
    class someClass(object):
        def __init__(self):
            pass
    
        def f(self, x):
            return x*x
    
        def g(self, x):
            return x*x+1    
    
        def go(self):
            p = Pool(4)
            sc = p.map(self, [{"func": "f", "v": 1}, {"func": "g", "v": 2}])
            print sc
    
        def __call__(self, x):
            if x["func"]=="f":
                return self.f(x["v"])
            if x["func"]=="g":
                return self.g(x["v"])        
    
    sc = someClass()
    sc.go()
    
    0 讨论(0)
  • 2020-11-22 01:05

    The problem is that multiprocessing must pickle things to sling them among processes, and bound methods are not picklable. The workaround (whether you consider it "easy" or not;-) is to add the infrastructure to your program to allow such methods to be pickled, registering it with the copy_reg standard library method.

    For example, Steven Bethard's contribution to this thread (towards the end of the thread) shows one perfectly workable approach to allow method pickling/unpickling via copy_reg.

    0 讨论(0)
  • 2020-11-22 01:05

    A potentially trivial solution to this is to switch to using multiprocessing.dummy. This is a thread based implementation of the multiprocessing interface that doesn't seem to have this problem in Python 2.7. I don't have a lot of experience here, but this quick import change allowed me to call apply_async on a class method.

    A few good resources on multiprocessing.dummy:

    https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.dummy

    http://chriskiehl.com/article/parallelism-in-one-line/

    0 讨论(0)
提交回复
热议问题