i just set up an AWS EMR Cluster (EMR Version 5.18 with Spark 2.3.2). I ssh into the master maschine and run spark-shell or pyspark and get the following error:
If you look into /etc/spark/conf/log4j.properties
file, you'll find that there's new setup allowing to roll Spark Streaming logs hourly (probably as it's suggested here).
The problem occurs because ${spark.yarn.app.container.log.dir}
system property is not set in Spark driver process. The property is set eventually to Yarn's container log directory, but this happens later (look here and here).
In order to fix this error in Spark driver, add the following to your spark-submit
or spark-shell
command:
--driver-java-options='-Dspark.yarn.app.container.log.dir=/mnt/var/log/hadoop'
Please note that /mnt/var/log/hadoop/stderr
and /mnt/var/log/hadoop/stdout
files will be reused by all the (Spark Streaming) processes started on the same node.