问题
Is it always the case that the Driver (as a program that runs the master node) must be on a master node ?
For example, if I setup the ec2 with one master and two workers, does my code that has the main must be executed from the master EC2 instance ?
If answer is NO, what would be the best way to set-up the system where the driver is outside the ec2's master node (lets say, Driver is ran from my computer, while Master and Workers are on EC2)? Do I always have to use the spark-submit, or can I do it from an IDE such as Eclipse or IntelliJ IDEA?
If answer is YES, what would be the best reference to learn more about it (since I need to provide some sort of a proof)?
Thank you kindly for your answer, references would be highly appreciated!
回答1:
No, it doesn't have to be on the master.
Using spark-submit
you can use deploy-mode to control how your driver is run (either as a client
, on the machine you run submit on (which could be master or another), or as cluster
, on the workers).
There is network communication between the workers and the driver so you want it 'close' to the workers, never across the WAN.
You can run from inside a repl (spark-shell
) which could be accessed from your IDE. If you're using a dynamic language like Clojure, you can also just create a SparkContext
referencing (through master
) a local cluster, or the cluster you want to put jobs to, and then code through the repl. In practice it isn't this easy.
来源:https://stackoverflow.com/questions/30022086/is-it-always-the-case-that-driver-must-be-on-a-master-node-yes-no-apache-spa