Pyspark --py-files doesn't work

后端未结

关注

 7  1262

I use this as document suggests http://spark.apache.org/docs/1.1.1/submitting-applications.html

spsark version 1.1.0

./spark/bin/spark-submit --py-f


                      
              相关标签:


      
      
        
          7条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  盖世英雄少女心        
                
              
                            
                2020-12-31 01:43
              
            
            
                                                                       
I was facing a similar kind of problem, My worker nodes could not detect the modules even though I was using the --py-files switch. 

There were couple of things I did - First I tried putting import statement after I created SparkContext (sc) variable hoping that import should take place after the module has shipped to all nodes but still it did not work. I then tried sc.addFile to add the module inside the script itself (instead of sending it as a command line argument) and afterwards imported the functions of the module. This did the trick at least in my case. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  广开言路        
                
              
                            
                2020-12-31 01:49
              
            
            
                                                                       
You need to package your Python code using tools like setuptools. This will let you create an .egg file which is similar to java jar file. You can then specify the path of this egg file using --py-files

spark-submit --py-files path_to_egg_file   path_to_spark_driver_file
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  忘了有多久        
                
              
                            
                2020-12-31 01:52
              
            
            
                                                                       
Try to import your custom module from inside the method itself rather than at the top of the driver script, e.g.:

def parse_record(record):
    import parser
    p = parser.parse(record)
    return p


rather than

import parser
def parse_record(record):
    p = parser.parse(record)
    return p


Cloud Pickle doesn't seem to recognise when a custom module has been imported, so it seems to try to pickle the top-level modules along with the other data that's needed to run the method. In my experience, this means that top-level modules appear to exist, but they lack usable members, and nested modules can't be used as expected. Once either importing with from A import * or from inside the method (import A.B), the modules worked as expected.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  野性不改        
                
              
                            
                2020-12-31 01:55
              
            
            
                                                                       
Try this function of SparkContext

sc.addPyFile(path)


According to pyspark documentation here


  Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.


Try upload your python module file to a public cloud storage (e.g. AWS S3) and pass the URL to that method.

Here is a more comprehensive reading material: http://www.cloudera.com/documentation/enterprise/5-5-x/topics/spark_python.html
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  逝去的感伤        
                
              
                            
                2020-12-31 01:57
              
            
            
                                                                       
PySpark on EMR is configured for Python 2.6 by default, so make sure they're not being installed for the Python 2.7 interpreter
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  后悔当初        
                
              
                            
                2020-12-31 02:03
              
            
            
                                                                       
Create zip files (example- abc.zip) containing all your dependencies.

While creating the spark context mention the zip file name as:

    sc = SparkContext(conf=conf, pyFiles=["abc.zip"])

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复