Can I add arguments to python code when I submit spark job?

后端 未结 5 1369
予麋鹿
予麋鹿 2020-12-28 13:24

I\'m trying to use spark-submit to execute my python code in spark cluster.

Generally we run spark-submit with python code like below.

相关标签:
5条回答
  • 2020-12-28 13:30

    Aniket Kulkarni's spark-submit args.py a b c d e seems to suffice, but it's worth mentioning we had issues with optional/named args (e.g --param1).

    It appears that double dashes -- will help signal that python optional args follow:

    spark-submit --sparkarg xxx yourscript.py -- --scriptarg 1 arg1 arg2
    
    0 讨论(0)
  • 2020-12-28 13:31

    Yes: Put this in a file called args.py

    #import sys
    print sys.argv
    

    If you run

    spark-submit args.py a b c d e 
    

    You will see:

    ['/spark/args.py', 'a', 'b', 'c', 'd', 'e']
    
    0 讨论(0)
  • 2020-12-28 13:51

    You can pass the arguments from the spark-submit command and then access them in your code in the following way,

    sys.argv[1] will get you the first argument, sys.argv[2] the second argument and so on. Refer to the below example,

    You can create code as below to take the arguments which you will be passing in the spark-submit command,

    import os
    import sys
    
    n = int(sys.argv[1])
    a = 2
    tables = []
    for _ in range(n):
        tables.append(sys.argv[a])
        a += 1
    print(tables)
    

    Save the above file as PysparkArg.py and execute the below spark-submit command,

    spark-submit PysparkArg.py 3 table1 table2 table3
    

    Output:

    ['table1', 'table2', 'table3']
    

    This piece of code can be used in PySpark jobs where it is required to fetch multiple tables from the database and, the number of tables to be fetched & the table names will be given by the user while executing the spark-submit command.

    0 讨论(0)
  • 2020-12-28 13:54

    Even though sys.argv is a good solution, I still prefer this more proper way of handling line command args in my PySpark jobs:

    import argparse
    
    parser = argparse.ArgumentParser()
    parser.add_argument("--ngrams", help="some useful description.")
    args = parser.parse_args()
    if args.ngrams:
        ngrams = args.ngrams
    

    This way, you can launch your job as follows:

    spark-submit job.py --ngrams 3
    

    More information about argparse module can be found in Argparse Tutorial

    0 讨论(0)
  • 2020-12-28 13:57

    Ah, it's possible. http://caen.github.io/hadoop/user-spark.html

    spark-submit \
        --master yarn-client \   # Run this as a Hadoop job
        --queue <your_queue> \   # Run on your_queue
        --num-executors 10 \     # Run with a certain number of executors, for example 10
        --executor-memory 12g \  # Specify each executor's memory, for example 12GB
        --executor-cores 2 \     # Specify each executor's amount of CPUs, for example 2
        job.py ngrams/input ngrams/output
    
    0 讨论(0)
提交回复
热议问题