Spark + Scala transformations, immutability & memory consumption overheads

前端 未结 2 582
星月不相逢
星月不相逢 2020-12-09 06:40

I have gone through some videos in Youtube regarding Spark architecture.

Even though Lazy evaluation, Resilience of data creation in case of failures, good functiona

相关标签:
2条回答
  • 2020-12-09 07:25

    The memory requirements of Spark not 10 times if you have 10 transformations in your Spark job. When you specify the steps of transformations in a job Spark builds a DAG which will allow it to execute all the steps in the jobs. After that it breaks the job down into stages. A stage is a sequence of transformations which Spark can execute on dataset without shuffling.

    When an action is triggered on the RDD, Spark evaluates the DAG. It just applies all the transformations in a stage together until it hits the end of the stage, so it is unlikely for the memory pressure to be 10 time unless each transformation leads to a shuffle (in which case it is probably a badly written job).

    I would recommend watching this talk and going through the slides.

    0 讨论(0)
  • 2020-12-09 07:29

    First off, the lazy execution means that functional composition can occur:

    scala> val rdd = sc.makeRDD(List("This is a test", "This is another test", 
                                     "And yet another test"), 1)
    rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[70] at makeRDD at <console>:27
    
    scala> val counts = rdd.flatMap(line => {println(line);line.split(" ")}).
         | map(word => {println(word);(word,1)}).
         | reduceByKey((x,y) => {println(s"$x+$y");x+y}).
         | collect
    This is a test
    This
    is
    a
    test
    This is another test
    This
    1+1
    is
    1+1
    another
    test
    1+1
    And yet another test
    And
    yet
    another
    1+1
    test
    2+1
    counts: Array[(String, Int)] = Array((And,1), (is,2), (another,2), (a,1), (This,2), (yet,1), (test,3))
    

    First note that I force the parallelism down to 1 so that we can see how this looks on a single worker. Then I add a println to each of the transformations so that we can see how the workflow moves. You see that it processes the line, then it processes the output of that line, followed by the reduction. So, there are not separate states stored for each transformation as you suggested. Instead, each piece of data is looped through the entire transformation up until a shuffle is needed, as can be seen by the DAG visualization from the UI:

    That is the win from the laziness. As to Spark v Hadoop, there is already a lot out there (just google it), but the gist is that Spark tends to utilize network bandwidth out of the box, giving it a boost right there. Then, there a number of performance improvements gained by laziness, especially if a schema is known and you can utilize the DataFrames API.

    So, overall, Spark beats MR hands down in just about every regard.

    0 讨论(0)
提交回复
热议问题