In Apache Spark program how do we know which part of code will execute in driver program and which part of code will execute in worker nodes?
With Regards
It is actually pretty simple. Everything that happens inside the closure created by a transformation happens on a worker. It means if something is passed inside map(...)
, filter(...)
, mapPartitions(...)
, groupBy*(...)
, aggregateBy*(...)
is executed on the workers. It includes reading data from a persistent storage or remote sources.
Actions like count
, reduce(...)
, fold(...)
are usually executed on both driver and workers. Heavy lifting is performed in parallel by the workers and some final steps, like reducing outputs received from the workers, is performed sequentially on the driver.
Everything else, like triggering an action or transformation happens on the driver. In particular it means every action which requires access to SparkContext
. In PySpark it means also a communication with Py4j gateway.
All the closures passed as argument to method of JavaRDD/JavaPairRDD/similar and some method of these classes will be executed by spark nodes. Everything else is driver code.
来源:https://stackoverflow.com/questions/33339200/differentiate-driver-code-and-work-code-in-apache-spark