Spark - repartition() vs coalesce()

前端 未结 14 1780
误落风尘
误落风尘 2020-11-22 17:11

According to Learning Spark

Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of

相关标签:
14条回答
  • 2020-11-22 17:15

    One additional point to note here is that, as the basic principle of Spark RDD is immutability. The repartition or coalesce will create new RDD. The base RDD will continue to have existence with its original number of partitions. In case the use case demands to persist RDD in cache, then the same has to be done for the newly created RDD.

    scala> pairMrkt.repartition(10)
    res16: org.apache.spark.rdd.RDD[(String, Array[String])] =MapPartitionsRDD[11] at repartition at <console>:26
    
    scala> res16.partitions.length
    res17: Int = 10
    
    scala>  pairMrkt.partitions.length
    res20: Int = 2
    
    0 讨论(0)
  • 2020-11-22 17:18

    In a simple way COALESCE :- is only for decreases the no of partitions , No shuffling of data it just compress the partitions

    REPARTITION:- is for both increase and decrease the no of partitions , But shuffling takes place

    Example:-

    val rdd = sc.textFile("path",7)
    rdd.repartition(10)
    rdd.repartition(2)
    

    Both works fine

    But we go generally for this two things when we need to see output in one cluster,we go with this.

    0 讨论(0)
  • 2020-11-22 17:19

    What follows from the code and code docs is that coalesce(n) is the same as coalesce(n, shuffle = false) and repartition(n) is the same as coalesce(n, shuffle = true)

    Thus, both coalesce and repartition can be used to increase number of partitions

    With shuffle = true, you can actually coalesce to a larger number of partitions. This is useful if you have a small number of partitions, say 100, potentially with a few partitions being abnormally large.

    Another important note to accentuate is that if you drastically decrease number of partitions you should consider using shuffled version of coalesce (same as repartition in that case). This will allow your computations be performed in parallel on parent partitions (multiple task).

    However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can pass shuffle = true. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).

    Please also refer to the related answer here

    0 讨论(0)
  • 2020-11-22 17:25

    All the answers are adding some great knowledge into this very often asked question.

    So going by tradition of this question's timeline, here are my 2 cents.

    I found the repartition to be faster than coalesce, in very specific case.

    In my application when the number of files that we estimate is lower than the certain threshold, repartition works faster.

    Here is what I mean

    if(numFiles > 20)
        df.coalesce(numFiles).write.mode(SaveMode.Overwrite).parquet(dest)
    else
        df.repartition(numFiles).write.mode(SaveMode.Overwrite).parquet(dest)
    

    In above snippet, if my files were less than 20, coalesce was taking forever to finish while repartition was much faster and so the above code.

    Of course, this number (20) will depend on the number of workers and amount of data.

    Hope that helps.

    0 讨论(0)
  • 2020-11-22 17:29

    The repartition algorithm does a full shuffle of the data and creates equal sized partitions of data. coalesce combines existing partitions to avoid a full shuffle.

    Coalesce works well for taking an RDD with a lot of partitions and combining partitions on a single worker node to produce a final RDD with less partitions.

    Repartition will reshuffle the data in your RDD to produce the final number of partitions you request. The partitioning of DataFrames seems like a low level implementation detail that should be managed by the framework, but it’s not. When filtering large DataFrames into smaller ones, you should almost always repartition the data. You’ll probably be filtering large DataFrames into smaller ones frequently, so get used to repartitioning.

    Read this blog post if you'd like even more details.

    0 讨论(0)
  • 2020-11-22 17:30

    Also another difference is taking into consideration a situation where there is a skew join and you have to coalesce on top of it. A repartition will solve the skew join in most cases, then you can do the coalesce.

    Another situation is, suppose you have saved a medium/large volume of data in a data frame and you have to produce to Kafka in batches. A repartition helps to collectasList before producing to Kafka in certain cases. But, when the volume is really high, the repartition will likely cause serious performance impact. In that case, producing to Kafka directly from dataframe would help.

    side notes: Coalesce does not avoid data movement as in full data movement between workers. It does reduce the number of shuffles happening though. I think that's what the book means.

    0 讨论(0)
提交回复
热议问题