Passing command line arguments to Spark-shell

后端 未结 3 1173
耶瑟儿~
耶瑟儿~ 2020-12-05 03:23

I have a spark job written in scala. I use

spark-shell -i 

to run the job. I need to pass a command-line argument to the

相关标签:
3条回答
  • 2020-12-05 03:40

    My solution is use a customized key to define arguments instead of spark.driver.extraJavaOptions, in case someday you pass in a value that may interfere JVM's behavior.

    spark-shell -i your_script.scala --conf spark.driver.args="arg1 arg2 arg3"
    

    You can access the arguments from within your scala code like this:

    val args = sc.getConf.get("spark.driver.args").split("\\s+")
    args: Array[String] = Array(arg1, arg2, arg3)
    
    0 讨论(0)
  • 2020-12-05 03:54

    I use the extraJavaOptions when I have a scala script which is too simple to go through the build process but I still need to pass arguments to it. It's not beautiful, but it works and you can quickly pass multiple arguments:

    spark-shell -i your_script.scala --conf spark.driver.extraJavaOptions="-Darg1,arg2,arg3"
    

    Note that -D does not belong to the arguments, which are arg1, arg2, and arg3. You can then access the arguments from within your scala code like this:

    val sconf = new SparkConf()
    
    // load string
    val paramsString = sconf.get("spark.driver.extraJavaOptions")
    
    // cut off `-D`
    val paramsSlice = paramsString.slice(2,paramsString.length)
    
    // split the string with `,` as delimiter and save the result to an array
    val paramsArray = paramsSlice.split(",")
    
    // access parameters
    val arg1 = paramsArray(0)
    
    0 讨论(0)
  • 2020-12-05 04:01

    Short answer:

    spark-shell -i <(echo val theDate = $INPUT_DATE ; cat <file-name>)

    Long answer:

    This solution causes the following line to be added at the beginning of the file before passed to spark-submit:

    val theDate = ...,

    thereby defining a new variable. The way this is done (the <( ... ) syntax) is called process substitution. It is available in Bash. See this question for more on this, and for alternatives (e.g. mkFifo) for non-Bash environments.

    Making this more systematic:

    Put the code below in a script (e.g. spark-script.sh), and then you can simply use:

    ./spark-script.sh your_file.scala first_arg second_arg third_arg, and have an Array[String] called args with your arguments.

    The file spark-script.sh:

    scala_file=$1
    
    shift 1
    
    arguments=$@
    
    #set +o posix  # to enable process substitution when not running on bash 
    
    spark-shell  --master yarn --deploy-mode client \
             --queue default \
            --driver-memory 2G --executor-memory 4G \
            --num-executors 10 \
            -i <(echo 'val args = "'$arguments'".split("\\s+")' ; cat $scala_file)
    
    0 讨论(0)
提交回复
热议问题