What is the difference between spark-submit and pyspark?

后端未结

关注

 3  1462

If I start up pyspark and then run this command:

import my_script; spark = my_script.Sparker(sc); spark.collapse(\'./data/\')

Everything is

相关标签:

3条回答

借酒劲吻你

2020-12-01 14:57
1. If you built a spark application, you need to use spark-submit to run the application
  - The code can be written either in python/scala
  - The mode can be either local/cluster
2. If you just want to test/run few individual commands, you can use the shell provided by spark
  - pyspark (for spark in python)
  - spark-shell (for spark in scala)
0 讨论(0)
发布评论:

提交评论
- 加载中...
予麋鹿

2020-12-01 15:06
spark-submit is a utility to submit your spark program (or job) to Spark clusters. If you open the spark-submit utility, it eventually calls a Scala program.
```
org.apache.spark.deploy.SparkSubmit 
```
On the other hand, pyspark or spark-shell is REPL (read–eval–print loop) utility which allows the developer to run/execute their spark code as they write and can evaluate on fly.

Eventually, both of them run a job behind the scene and the majority of the options are the same if you use the following command
```
spark-submit --help
pyspark --help
spark-shell --help
```
spark-submit has some additional option to take your spark program (scala or python) as a bundle (jar/zip for python) or individual .py or .class file.
```
spark-submit --help
Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
```
They both also give a WebUI to track the Spark Job progress and other metrics.

When you kill your spark-shell (pyspark or spark-shell) using Ctrl+c, your spark session is killed and WebUI can not show details anymore.

if you look into spark-shell, it has one additional option to run a scrip line by line using -I
```
Scala REPL options:
  -I <file>                   preload <file>, enforcing line-by-line interpretation
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
傲寒

2020-12-01 15:16

pyspark command is REPL (read–eval–print loop) which is used to start an interactive shell to test few PySpark commands. This is used during development time. We are talking about Python here.

To run spark application written in Scala or Python on a cluster or locally, you can use spark-submit.

0 讨论(0)
发布评论:

提交评论
- 加载中...