Internal Work of Spark

前端 未结 2 728
夕颜
夕颜 2020-12-04 05:36

Now a days Spark is in progress. Spark used scala language to load and execute the program and also python and java. RDD is used to store the data. But, I can\'t understand

相关标签:
2条回答
  • 2020-12-04 05:58

    Even i have been looking in the web to learn about the internals of Spark, below is what i could learn and thought of sharing here,

    Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.

    Spark translates the RDD transformations into something called DAG (Directed Acyclic Graph) and starts the execution,

    At high level, when any action is called on the RDD, Spark creates the DAG and submits to the DAG scheduler.

    • The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together. For e.g. Many map operators can be scheduled in a single stage. The final result of a DAG scheduler is a set of stages.

    • The Stages are passed on to the Task Scheduler.The task scheduler launches tasks via cluster manager.(Spark Standalone/Yarn/Mesos). The task scheduler doesn't know about dependencies of the stages.

    • The Worker executes the tasks on the Slave.

    Let's come to how Spark builds the DAG.

    At high level, there are two transformations that can be applied onto the RDDs, namely narrow transformation and wide transformation. Wide transformations basically result in stage boundaries.

    Narrow transformation - doesn't require the data to be shuffled across the partitions. for example, Map, filter and etc..

    wide transformation - requires the data to be shuffled for example, reduceByKey and etc..

    Let's take an example of counting how many log messages appear at each level of severity,

    Following is the log file that starts with the severity level,

    INFO I'm Info message
    WARN I'm a Warn message
    INFO I'm another Info message
    

    and create the following scala code to extract the same,

    val input = sc.textFile("log.txt")
    val splitedLines = input.map(line => line.split(" "))
                            .map(words => (words(0), 1))
                            .reduceByKey{(a,b) => a + b}
    

    This sequence of commands implicitly defines a DAG of RDD objects (RDD lineage) that will be used later when an action is called. Each RDD maintains a pointer to one or more parent along with the metadata about what type of relationship it has with the parent. For example, when we call val b = a.map() on a RDD, the RDD b keeps a reference to its parent a, that's a lineage.

    To display the lineage of an RDD, Spark provides a debug method toDebugString() method. For example executing toDebugString() on splitedLines RDD, will output the following,

    (2) ShuffledRDD[6] at reduceByKey at <console>:25 []
        +-(2) MapPartitionsRDD[5] at map at <console>:24 []
        |  MapPartitionsRDD[4] at map at <console>:23 []
        |  log.txt MapPartitionsRDD[1] at textFile at <console>:21 []
        |  log.txt HadoopRDD[0] at textFile at <console>:21 []
    

    The first line (from bottom) shows the input RDD. We created this RDD by calling sc.textFile(). See below more diagrammatic view of the DAG graph created from the given RDD.

    RDD DAG graph

    Once the DAG is build, Spark scheduler creates a physical execution plan. As mentioned above, the DAG scheduler splits the graph into multiple stages, the stages are created based on the transformations. The narrow transformations will be grouped (pipe-lined) together into a single stage. So for our example, Spark will create two stage execution as follows,

    Stages

    The DAG scheduler then submit the stages into the task scheduler. The number of tasks submitted depends on the number of partitions present in the textFile. Fox example consider we have 4 partitions in this example, then there will be 4 set of tasks created and submitted in parallel provided if there are enough slaves/cores. Below diagram illustrates this in bit more detail,

    Task execustion

    For more detailed information i suggest you to go through the following youtube videos where the Spark creators give in depth details about the DAG and execution plan and lifetime.

    1. Advanced Apache Spark- Sameer Farooqui (Databricks)
    2. A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)
    3. Introduction to AmpLab Spark Internals
    0 讨论(0)
  • 2020-12-04 06:14

    The diagram below shows how Apache Spark internally working:

    0 讨论(0)
提交回复
热议问题