I have a spark job written in scala. I use
spark-shell -i
to run the job. I need to pass a command-line argument to the
My solution is use a customized key to define arguments instead of spark.driver.extraJavaOptions
, in case someday you pass in a value that may interfere JVM's behavior.
spark-shell -i your_script.scala --conf spark.driver.args="arg1 arg2 arg3"
You can access the arguments from within your scala code like this:
val args = sc.getConf.get("spark.driver.args").split("\\s+")
args: Array[String] = Array(arg1, arg2, arg3)
I use the extraJavaOptions
when I have a scala script which is too simple to go through the build process but I still need to pass arguments to it. It's not beautiful, but it works and you can quickly pass multiple arguments:
spark-shell -i your_script.scala --conf spark.driver.extraJavaOptions="-Darg1,arg2,arg3"
Note that -D
does not belong to the arguments, which are arg1
, arg2
, and arg3
. You can then access the arguments from within your scala code like this:
val sconf = new SparkConf()
// load string
val paramsString = sconf.get("spark.driver.extraJavaOptions")
// cut off `-D`
val paramsSlice = paramsString.slice(2,paramsString.length)
// split the string with `,` as delimiter and save the result to an array
val paramsArray = paramsSlice.split(",")
// access parameters
val arg1 = paramsArray(0)
spark-shell -i <(echo val theDate = $INPUT_DATE ; cat <file-name>)
This solution causes the following line to be added at the beginning of the file before passed to spark-submit
:
val theDate = ...
,
thereby defining a new variable. The way this is done (the <( ... )
syntax) is called process substitution. It is available in Bash. See this question for more on this, and for alternatives (e.g. mkFifo
) for non-Bash environments.
Put the code below in a script (e.g. spark-script.sh
), and then you can simply use:
./spark-script.sh your_file.scala first_arg second_arg third_arg
,
and have an Array[String]
called args
with your arguments.
The file spark-script.sh
:
scala_file=$1
shift 1
arguments=$@
#set +o posix # to enable process substitution when not running on bash
spark-shell --master yarn --deploy-mode client \
--queue default \
--driver-memory 2G --executor-memory 4G \
--num-executors 10 \
-i <(echo 'val args = "'$arguments'".split("\\s+")' ; cat $scala_file)