Spark can't pickle method_descriptor

前端未结

关注

 2  422

I get this weird error message

15/01/26 13:05:12 INFO spark.SparkContext: Created broadcast 0 from wholeTextFiles at NativeMethodAccessorImpl.java:-2
Traceba


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  花落未央        
                
              
                            
                2021-01-05 10:16
              
            
            
                                                                       
If it's really a pickling issue for a MethodDescriptorType, you could register how to pickle a MethodDescriptorType, with this:

def _getattr(objclass, name, repr_str):
    # hack to grab the reference directly
    try:
        attr = repr_str.split("'")[3]
        return eval(attr+'.__dict__["'+name+'"]')
    except:
        attr = getattr(objclass,name)
        if name == '__dict__':
            attr = attr[name]
        return attar


def save_wrapper_descriptor(pickler, obj):
    pickler = Pickler(file, protocol)
    pickler.save_reduce(_getattr, (obj.__objclass__, obj.__name__,
                                   obj.__repr__()), obj=obj)
    return

# register the following "type" with:
#     Pickler.dispatch[MethodDescriptorType] = save_wrapper_descriptor
MethodDescriptorType = type(type.__dict__['mro'])


Then, if you register the above to the pickling dispatch table that spark uses (as shown above, or with copy_reg), it may get past the pickling error.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉话见心        
                
              
                            
                2021-01-05 10:18
              
            
            
                                                                       
Spark tries to serialize the connect object so it can be used inside the executors, which will surely fail because a deserialized db connect object can't grant read/write permission to another scope (or even computer). The problem can be reproduced by trying to broadcast the connect object. For this instance there was a problem on serializing an i/o object.

The problem was partly solved by connecting to the database inside the map functions. Since there will be too many connections for each RDD element in the map function, I had to switch to partition processing to reduce the db connections from 20k to about 8-64 (based on number of partitions). Spark developers should consider creating an initialization function/script for the executors to avoid these kind of dead end problems.

So let's say I got this init function executed by every node, then every node will be connected to the database (some conn pool, or separate zookeeper nodes) because the init function and the map functions will share the same scope, and then the problem is gone, so you write faster code than the workaround I found. At the end of the execution spark will free/unload these defined variables and the program will end.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复