I am running an spark cluster on google cloud and I upload a configuration file with each job. What is the path to a file that is uploaded with a submit command?
In the example below how can I read the file Configuration.properties
before the SparkContext has been initialized? I am using Scala.
gcloud dataproc jobs submit spark --cluster my-cluster --class MyJob --files config/Configuration.properties --jars my.jar
Local path to a file distributed using SparkFiles
mechanism (--files
argument, SparkContext.addFile
) method can be obtained using SparkFiles.get
:
org.apache.spark.SparkFiles.get(fileName)
You can also get the path to the root directory using SparkFiles.getRootDirectory
:
org.apache.spark.SparkFiles.getRootDirectory
You can use these combined with standard IO utilities to read the files.
how can I read the file Configuration.properties before the SparkContext has been initialized?
SparkFiles
are distributed by the driver, cannot be accessed before context has been initialized, and to be distributed in the first place, have to be accessible from the driver node. So this part of the question solely depends what type of storage you'll use to expose the file to the driver node.
来源:https://stackoverflow.com/questions/41677897/how-to-get-path-to-the-uploaded-file