Can't get a SparkContext in new AWS EMR Cluster

后端未结

关注

 3  740

i just set up an AWS EMR Cluster (EMR Version 5.18 with Spark 2.3.2). I ssh into the master maschine and run spark-shell or pyspark and get the following error:


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  谎友^        
                
              
                            
                2021-02-07 10:32
              
            
            
                                                                       
in order to fix this issue you can add configuration in json format on emr provisioning. We use code like this

{
    "Classification": "yarn-site",
    "Configurations": [
    ],
    "Properties": {
      "spark.yarn.app.container.log.dir": "/var/log/hadoop-yarn"
    }
  }

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南笙        
                
              
                            
                2021-02-07 10:35
              
            
            
                                                                       
If you look into /etc/spark/conf/log4j.properties file, you'll find that there's new setup allowing to roll Spark Streaming logs hourly (probably as it's suggested here). 

The problem occurs because ${spark.yarn.app.container.log.dir} system property is not set in Spark driver process. The property is set eventually to Yarn's container log directory, but this happens later (look here and here).

In order to fix this error in Spark driver, add the following to your spark-submit or spark-shell command:

--driver-java-options='-Dspark.yarn.app.container.log.dir=/mnt/var/log/hadoop'


Please note that /mnt/var/log/hadoop/stderr and /mnt/var/log/hadoop/stdout files will be reused by all the (Spark Streaming) processes started on the same node.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  佛祖请我去吃肉        
                
              
                            
                2021-02-07 10:40
              
            
            
                                                                       
We have also run into this issue and hope some AWS or Spark engineers are reading this. I've narrowed this down to the /etc/spark/conf/log4j.properties file and how the loggers are configured using the ${spark.yarn.app.container.log.dir} system property. That value is evaluating to null and so the logging directory now evaluates to /stdout and /stderr instead of the desired /mnt/var/log/hadoop-yarn/containers/<app_id>/<container_id>/(stdout|stderr) which is how it worked in EMR < 5.18.0.

Workaround #1 (not ideal): If you set that property to a static path which the hadoop user has access to like /var/log/hadoop-yarn/stderr things work fine. This probably breaks things like the history server and an unknown number of other things, but spark-shell and pyspark can start without errors.

UPDATE Workaround #2 (revert): Not sure why I didn't do this earlier but comparing this to a 5.13 cluster, the entirety of the DRFA-stderr and DRFA-stdout appenders were non-existent. If you comment those sections out, delete them, or simply copy the log4j.properties file from the template this problem also goes away (again, unknown impact to the rest of the services). I'm not sure where that section originated from, the master repo configs do not have those appenders so it appears to be proprietary to AWS distros.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复