Spark can't pickle method_descriptor

前端 未结 2 422
北恋
北恋 2021-01-05 09:37

I get this weird error message

15/01/26 13:05:12 INFO spark.SparkContext: Created broadcast 0 from wholeTextFiles at NativeMethodAccessorImpl.java:-2
Traceba         


        
相关标签:
2条回答
  • 2021-01-05 10:16

    If it's really a pickling issue for a MethodDescriptorType, you could register how to pickle a MethodDescriptorType, with this:

    def _getattr(objclass, name, repr_str):
        # hack to grab the reference directly
        try:
            attr = repr_str.split("'")[3]
            return eval(attr+'.__dict__["'+name+'"]')
        except:
            attr = getattr(objclass,name)
            if name == '__dict__':
                attr = attr[name]
            return attar
    
    
    def save_wrapper_descriptor(pickler, obj):
        pickler = Pickler(file, protocol)
        pickler.save_reduce(_getattr, (obj.__objclass__, obj.__name__,
                                       obj.__repr__()), obj=obj)
        return
    
    # register the following "type" with:
    #     Pickler.dispatch[MethodDescriptorType] = save_wrapper_descriptor
    MethodDescriptorType = type(type.__dict__['mro'])
    

    Then, if you register the above to the pickling dispatch table that spark uses (as shown above, or with copy_reg), it may get past the pickling error.

    0 讨论(0)
  • 2021-01-05 10:18

    Spark tries to serialize the connect object so it can be used inside the executors, which will surely fail because a deserialized db connect object can't grant read/write permission to another scope (or even computer). The problem can be reproduced by trying to broadcast the connect object. For this instance there was a problem on serializing an i/o object.

    The problem was partly solved by connecting to the database inside the map functions. Since there will be too many connections for each RDD element in the map function, I had to switch to partition processing to reduce the db connections from 20k to about 8-64 (based on number of partitions). Spark developers should consider creating an initialization function/script for the executors to avoid these kind of dead end problems.

    So let's say I got this init function executed by every node, then every node will be connected to the database (some conn pool, or separate zookeeper nodes) because the init function and the map functions will share the same scope, and then the problem is gone, so you write faster code than the workaround I found. At the end of the execution spark will free/unload these defined variables and the program will end.

    0 讨论(0)
提交回复
热议问题