Can't get a SparkContext in new AWS EMR Cluster

后端 未结 3 736
名媛妹妹
名媛妹妹 2021-02-07 10:19

i just set up an AWS EMR Cluster (EMR Version 5.18 with Spark 2.3.2). I ssh into the master maschine and run spark-shell or pyspark and get the following error:

         


        
相关标签:
3条回答
  • 2021-02-07 10:32

    in order to fix this issue you can add configuration in json format on emr provisioning. We use code like this

    {
        "Classification": "yarn-site",
        "Configurations": [
        ],
        "Properties": {
          "spark.yarn.app.container.log.dir": "/var/log/hadoop-yarn"
        }
      }
    
    0 讨论(0)
  • 2021-02-07 10:35

    If you look into /etc/spark/conf/log4j.properties file, you'll find that there's new setup allowing to roll Spark Streaming logs hourly (probably as it's suggested here).

    The problem occurs because ${spark.yarn.app.container.log.dir} system property is not set in Spark driver process. The property is set eventually to Yarn's container log directory, but this happens later (look here and here).

    In order to fix this error in Spark driver, add the following to your spark-submit or spark-shell command: --driver-java-options='-Dspark.yarn.app.container.log.dir=/mnt/var/log/hadoop'

    Please note that /mnt/var/log/hadoop/stderr and /mnt/var/log/hadoop/stdout files will be reused by all the (Spark Streaming) processes started on the same node.

    0 讨论(0)
  • 2021-02-07 10:40

    We have also run into this issue and hope some AWS or Spark engineers are reading this. I've narrowed this down to the /etc/spark/conf/log4j.properties file and how the loggers are configured using the ${spark.yarn.app.container.log.dir} system property. That value is evaluating to null and so the logging directory now evaluates to /stdout and /stderr instead of the desired /mnt/var/log/hadoop-yarn/containers/<app_id>/<container_id>/(stdout|stderr) which is how it worked in EMR < 5.18.0.

    Workaround #1 (not ideal): If you set that property to a static path which the hadoop user has access to like /var/log/hadoop-yarn/stderr things work fine. This probably breaks things like the history server and an unknown number of other things, but spark-shell and pyspark can start without errors.

    UPDATE Workaround #2 (revert): Not sure why I didn't do this earlier but comparing this to a 5.13 cluster, the entirety of the DRFA-stderr and DRFA-stdout appenders were non-existent. If you comment those sections out, delete them, or simply copy the log4j.properties file from the template this problem also goes away (again, unknown impact to the rest of the services). I'm not sure where that section originated from, the master repo configs do not have those appenders so it appears to be proprietary to AWS distros.

    0 讨论(0)
提交回复
热议问题