i just set up an AWS EMR Cluster (EMR Version 5.18 with Spark 2.3.2). I ssh into the master maschine and run spark-shell or pyspark and get the following error:
in order to fix this issue you can add configuration in json format on emr provisioning. We use code like this
{
"Classification": "yarn-site",
"Configurations": [
],
"Properties": {
"spark.yarn.app.container.log.dir": "/var/log/hadoop-yarn"
}
}
If you look into /etc/spark/conf/log4j.properties
file, you'll find that there's new setup allowing to roll Spark Streaming logs hourly (probably as it's suggested here).
The problem occurs because ${spark.yarn.app.container.log.dir}
system property is not set in Spark driver process. The property is set eventually to Yarn's container log directory, but this happens later (look here and here).
In order to fix this error in Spark driver, add the following to your spark-submit
or spark-shell
command:
--driver-java-options='-Dspark.yarn.app.container.log.dir=/mnt/var/log/hadoop'
Please note that /mnt/var/log/hadoop/stderr
and /mnt/var/log/hadoop/stdout
files will be reused by all the (Spark Streaming) processes started on the same node.
We have also run into this issue and hope some AWS or Spark engineers are reading this. I've narrowed this down to the /etc/spark/conf/log4j.properties
file and how the loggers are configured using the ${spark.yarn.app.container.log.dir}
system property. That value is evaluating to null
and so the logging directory now evaluates to /stdout
and /stderr
instead of the desired /mnt/var/log/hadoop-yarn/containers/<app_id>/<container_id>/(stdout|stderr)
which is how it worked in EMR < 5.18.0.
Workaround #1 (not ideal): If you set that property to a static path which the hadoop user has access to like /var/log/hadoop-yarn/stderr
things work fine. This probably breaks things like the history server and an unknown number of other things, but spark-shell and pyspark can start without errors.
UPDATE Workaround #2 (revert): Not sure why I didn't do this earlier but comparing this to a 5.13 cluster, the entirety of the DRFA-stderr and DRFA-stdout appenders were non-existent. If you comment those sections out, delete them, or simply copy the log4j.properties file from the template this problem also goes away (again, unknown impact to the rest of the services). I'm not sure where that section originated from, the master repo configs do not have those appenders so it appears to be proprietary to AWS distros.