I have a Spark job that reads data from a configuration file. This file is a typesafe config file.
The code that reads the config looks like that:
Config
So with a little digging in the Spark 1.6.1 source code I found the solution.
These are the steps that you need to take in order to get both the log4j and the application.conf being used by your application when submitting to yarn using cluster mode:
--files "$ROOT_DIR/application.conf,$LOG4J_FULL_PATH/log4j.xml"
(separate them by comma)--conf spark.driver.extraJavaOptions="-Dlog4j.configuration=file:log4j.xml"
- notice that once you pass it with --files you can just refer to the file name without any pathNote: I haven't tried it but from what I saw if you're trying to run it in client mode I think the spark.driver.extraJavaOptions
line should be renamed to something like driver-java-options
Thats it. So simple and I wish these things were documented better. I hope this answer will help someone
Cheers
Even though, it is a question from a year ago, I had a simmilar issue with the ConfigFactor.
To be able to read application.conf
file, you have to do two things.
--files /path/to/file/application.conf
. Note that you can read it from HDFS if you wish.--packages com.typesafe:config:version
.Since the application.conf
file will be at the same temporary directory than the main jar aplication, you can assume in your code.
Using the answer gave above (https://stackoverflow.com/a/40586476/6615465), the code for this question will be the following:
LOG4J_FULL_PATH=/log4j-path
ROOT_DIR=/application.conf-path
/opt/deploy/spark/bin/spark-submit \
--packages com.typesafe:config:1.3.2
--class com.mycompany.Main \
--master yarn \
--deploy-mode cluster \
--files "$ROOT_DIR/application.conf, $LOG4J_FULL_PATH/log4j.xml" \
--conf spark.executor.extraClassPath="-Dconfig.file=file:application.conf" \
--driver-class-path $ROOT_DIR/application.conf \
--verbose \
/opt/deploy/lal-ml.jar