How to access external property file in spark-submit job?

前端 未结 2 477
有刺的猬
有刺的猬 2021-01-24 02:37

I am using spark 2.4.1 version and java8. I am trying to load external property file while submitting my spark job using spark-submit.

As I am using below TypeSafe to lo

相关标签:
2条回答
  • 2021-01-24 03:19

    --files and SparkFiles.get

    With --files you should access the resource using SparkFiles.get as follows:

    $ ./bin/spark-shell --files README.md
    
    scala> import org.apache.spark._
    import org.apache.spark._
    
    scala> SparkFiles.get("README.md")
    res0: String = /private/var/folders/0w/kb0d3rqn4zb9fcc91pxhgn8w0000gn/T/spark-f0b16df1-fba6-4462-b956-fc14ee6c675a/userFiles-eef6d900-cd79-4364-a4a2-dd177b4841d2/README.md
    

    In other words, Spark will distribute the --files to executors, but the only way to know the path of the files is to use SparkFiles utility.

    getResourceAsStream(resourceFile) and InputStream

    The other option would be to package all resource files into a jar file and bundle it together with the other jar files (either as a single uber-jar or simply as part of CLASSPATH of the Spark app) and use the following trick:

    this.getClass.getClassLoader.getResourceAsStream(resourceFile)
    

    With that, regardless of the jar file the resourceFile is in, as long as it's on the CLASSPATH, it should be available to the application.

    I'm pretty sure any decent framework or library that uses resource files for configuration, e.g. Typesafe Config, accepts InputStream as the way to read resource files.


    You could also include the --files as part of a jar file that is part of the CLASSPATH of the executors, but that'd be obviously less flexible (as every time you'd like to submit your Spark app with a different file, you'd have to recreate the jar).

    0 讨论(0)
  • 2021-01-24 03:24

    The proper way to list files for the --files, --jars and other similar arguments is via a comma without any spaces (this is a crucial thing, and you see the exception about invalid main class precisely because of this):

    --files /local/apps/log4j.properties,/local/apps/applicationNew.properties
    

    If file names themselves have spaces in it, you should use quotes to escape these spaces:

    --files "/some/path with/spaces.properties,/another path with/spaces.properties"
    

    Another issue is that you specify the same property twice:

    ...
    --conf spark.driver.extraJavaOptions=-Dconfig.file=./applicationNew.properties \
    ...
    --conf spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
    ...
    

    There is no way for spark-submit to know how to merge these values, therefore only one of them is used. This is the reason why you see null for the config.file system property: it's just the second --conf argument takes priority and overrides the extraJavaOptions property with a single path to the log4j config file. Thus, the correct way is to specify all these values as one property:

    --conf spark.driver.extraJavaOptions="-Dlog4j.configuration=file:./log4j.properties -Dconfig.file=./applicationNew.properties"
    

    Note that because of quotes, the entire spark.driver.extraJavaOptions="..." is one command line argument rather than several, which is very important for spark-submit to pass these arguments to the driver/executor JVM correctly.

    (I also changed the log4j.properties file to use a proper URI instead of a file. I recall that without this path being a URI it might not work, but you can try either way and check for sure.)

    0 讨论(0)
提交回复
热议问题