I am using spark 2.4.1 version and java8. I am trying to load external property file while submitting my spark job using spark-submit.
As I am using below TypeSafe to lo
With --files
you should access the resource using SparkFiles.get
as follows:
$ ./bin/spark-shell --files README.md
scala> import org.apache.spark._
import org.apache.spark._
scala> SparkFiles.get("README.md")
res0: String = /private/var/folders/0w/kb0d3rqn4zb9fcc91pxhgn8w0000gn/T/spark-f0b16df1-fba6-4462-b956-fc14ee6c675a/userFiles-eef6d900-cd79-4364-a4a2-dd177b4841d2/README.md
In other words, Spark will distribute the --files
to executors, but the only way to know the path of the files is to use SparkFiles
utility.
The other option would be to package all resource files into a jar file and bundle it together with the other jar files (either as a single uber-jar or simply as part of CLASSPATH of the Spark app) and use the following trick:
this.getClass.getClassLoader.getResourceAsStream(resourceFile)
With that, regardless of the jar file the resourceFile
is in, as long as it's on the CLASSPATH, it should be available to the application.
I'm pretty sure any decent framework or library that uses resource files for configuration, e.g. Typesafe Config, accepts InputStream
as the way to read resource files.
You could also include the --files
as part of a jar file that is part of the CLASSPATH of the executors, but that'd be obviously less flexible (as every time you'd like to submit your Spark app with a different file, you'd have to recreate the jar).
The proper way to list files for the --files
, --jars
and other similar arguments is via a comma without any spaces (this is a crucial thing, and you see the exception about invalid main class precisely because of this):
--files /local/apps/log4j.properties,/local/apps/applicationNew.properties
If file names themselves have spaces in it, you should use quotes to escape these spaces:
--files "/some/path with/spaces.properties,/another path with/spaces.properties"
Another issue is that you specify the same property twice:
...
--conf spark.driver.extraJavaOptions=-Dconfig.file=./applicationNew.properties \
...
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
...
There is no way for spark-submit to know how to merge these values, therefore only one of them is used. This is the reason why you see null
for the config.file
system property: it's just the second --conf
argument takes priority and overrides the extraJavaOptions
property with a single path to the log4j config file. Thus, the correct way is to specify all these values as one property:
--conf spark.driver.extraJavaOptions="-Dlog4j.configuration=file:./log4j.properties -Dconfig.file=./applicationNew.properties"
Note that because of quotes, the entire spark.driver.extraJavaOptions="..."
is one command line argument rather than several, which is very important for spark-submit to pass these arguments to the driver/executor JVM correctly.
(I also changed the log4j.properties
file to use a proper URI instead of a file. I recall that without this path being a URI it might not work, but you can try either way and check for sure.)