What are the Spark transformations that causes a Shuffle?

后端 未结 4 1373
情书的邮戳
情书的邮戳 2020-11-29 22:55

I have trouble to find in the Spark documentation operations that causes a shuffle and operation that does not. In this list, which ones does cause a shuffle and which ones

相关标签:
4条回答
  • 2020-11-29 23:29

    Here is a list of operations that might cause a shuffle:

    cogroup

    groupWith

    join: hash partition

    leftOuterJoin: hash partition

    rightOuterJoin: hash partition

    groupByKey: hash partition

    reduceByKey: hash partition

    combineByKey: hash partition

    sortByKey: range partition

    distinct

    intersection: hash partition

    repartition

    coalesce

    Source: Big Data Analysis with Spark and Scala, Optimizing with Partitions, Coursera

    0 讨论(0)
  • 2020-11-29 23:31

    It is actually extremely easy to find this out, without the documentation. For any of these functions just create an RDD and call to debug string, here is one example you can do the rest on ur own.

    scala> val a  = sc.parallelize(Array(1,2,3)).distinct
    scala> a.toDebugString
    MappedRDD[5] at distinct at <console>:12 (1 partitions)
      MapPartitionsRDD[4] at distinct at <console>:12 (1 partitions)
        **ShuffledRDD[3] at distinct at <console>:12 (1 partitions)**
          MapPartitionsRDD[2] at distinct at <console>:12 (1 partitions)
            MappedRDD[1] at distinct at <console>:12 (1 partitions)
              ParallelCollectionRDD[0] at parallelize at <console>:12 (1 partitions)
    

    So as you can see distinct creates a shuffle. It is also particularly important to find out this way rather than docs because there are situations where a shuffle will be required or not required for a certain function. For example join usually requires a shuffle but if you join two RDD's that branch from the same RDD spark can sometimes elide the shuffle.

    0 讨论(0)
  • 2020-11-29 23:32

    Here is the generalised statement on shuffling transformations.

    Transformations which can cause a shuffle include repartition operations like repartition and coalesce, ‘ByKey operations (except for counting) like groupByKey and reduceByKey, and join operations like cogroup and join.

    source

    0 讨论(0)
  • 2020-11-29 23:42

    This might be helpful: https://spark.apache.org/docs/latest/programming-guide.html#shuffle-operations

    or this: http://www.slideshare.net/SparkSummit/dev-ops-training, starting with slide 208

    from slide 209: "Transformations that use 'numPartitions' like distinct will probably shuffle"

    0 讨论(0)
提交回复
热议问题