Can I add arguments to python code when I submit spark job?

后端未结

关注

 5  1369

予麋鹿

I\'m trying to use spark-submit to execute my python code in spark cluster.

Generally we run spark-submit with python code like below.

相关标签:

5条回答

萌比男神i

2020-12-28 13:30
Aniket Kulkarni's spark-submit args.py a b c d e seems to suffice, but it's worth mentioning we had issues with optional/named args (e.g --param1).

It appears that double dashes -- will help signal that python optional args follow:
```
spark-submit --sparkarg xxx yourscript.py -- --scriptarg 1 arg1 arg2
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
滥情空心

2020-12-28 13:31
Yes: Put this in a file called args.py
```
#import sys
print sys.argv
```
If you run
```
spark-submit args.py a b c d e 
```
You will see:
```
['/spark/args.py', 'a', 'b', 'c', 'd', 'e']
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
心在旅途

2020-12-28 13:51
You can pass the arguments from the spark-submit command and then access them in your code in the following way,

sys.argv[1] will get you the first argument, sys.argv[2] the second argument and so on. Refer to the below example,

You can create code as below to take the arguments which you will be passing in the spark-submit command,
```
import os
import sys

n = int(sys.argv[1])
a = 2
tables = []
for _ in range(n):
    tables.append(sys.argv[a])
    a += 1
print(tables)
```
Save the above file as PysparkArg.py and execute the below spark-submit command,
```
spark-submit PysparkArg.py 3 table1 table2 table3
```
Output:
```
['table1', 'table2', 'table3']
```
This piece of code can be used in PySpark jobs where it is required to fetch multiple tables from the database and, the number of tables to be fetched & the table names will be given by the user while executing the spark-submit command.
0 讨论(0)
发布评论:

提交评论
- 加载中...
别跟我提以往

2020-12-28 13:54
Even though sys.argv is a good solution, I still prefer this more proper way of handling line command args in my PySpark jobs:
```
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--ngrams", help="some useful description.")
args = parser.parse_args()
if args.ngrams:
    ngrams = args.ngrams
```
This way, you can launch your job as follows:
```
spark-submit job.py --ngrams 3
```
More information about argparse module can be found in Argparse Tutorial
0 讨论(0)
发布评论:

提交评论
- 加载中...

轮回少年

2020-12-28 13:57

Ah, it's possible. http://caen.github.io/hadoop/user-spark.html

spark-submit \
    --master yarn-client \   # Run this as a Hadoop job
    --queue <your_queue> \   # Run on your_queue
    --num-executors 10 \     # Run with a certain number of executors, for example 10
    --executor-memory 12g \  # Specify each executor's memory, for example 12GB
    --executor-cores 2 \     # Specify each executor's amount of CPUs, for example 2
    job.py ngrams/input ngrams/output

0 讨论(0)