After reading some document on http://spark.apache.org/docs/0.8.0/cluster-overview.html, I got some question that I want to clarify.
Take this example from Spark:
When you create the SparkContext
, each worker starts an executor. This is a separate process (JVM), and it loads your jar too. The executors connect back to your driver program. Now the driver can send them commands, like flatMap
, map
and reduceByKey
in your example. When the driver quits, the executors shut down.
RDDs are sort of like big arrays that are split into partitions, and each executor can hold some of these partitions.
A task is a command sent from the driver to an executor by serializing your Function
object. The executor deserializes the command (this is possible because it has loaded your jar), and executes it on a partition.
(This is a conceptual overview. I am glossing over some details, but I hope it is helpful.)
To answer your specific question: No, a new process is not started for each step. A new process is started on each worker when the SparkContext
is constructed.
To get a clear insight on how tasks are created and scheduled, we must understand how execution model works in Spark. Shortly speaking, an application in spark is executed in three steps :
In your word-count example, the RDD graph is rather simple, it's something as follows :
file -> lines -> words -> per-word count -> global word count -> output
Based on this graph, two stages are created. The stage creation rule is based on the idea to pipeline as many narrow transformations as possible. In your example, the narrow transformation finishes at per-word count. Therefore, you get two stages
Once stages are figured out, spark will generate tasks from stages. The first stage will create ShuffleMapTasks and the last stage will create ResultTasks because in the last stage, one action operation is included to produce results.
The number of tasks to be generated depends on how your files are distributed. Suppose that you have 3 three different files in three different nodes, the first stage will generate 3 tasks : one task per partition.
Therefore, you should not map your steps to tasks directly. A task belongs to a stage, and is related to a partition.
Usually, the number of tasks ran for a stage is exactly the number of partitions of the final RDD, but since RDDs can be shared (and hence ShuffleMapStages
) their number varies depending on the RDD/stage sharing. Please refer to How DAG works under the covers in RDD?