Spark - repartition() vs coalesce()

前端未结

关注

 14  1791

According to Learning Spark

Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of


                      
              相关标签:


      
      
        
          14条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  温柔的废话        
                
              
                            
                2020-11-22 17:15
              
            
            
                                                                       
One additional point to note here is that, as the basic principle of Spark RDD is immutability. The repartition or coalesce will create new RDD. The base RDD will continue to have existence with its original number of partitions. In case the use case demands to persist RDD in cache, then the same has to be done for the newly created RDD.

scala> pairMrkt.repartition(10)
res16: org.apache.spark.rdd.RDD[(String, Array[String])] =MapPartitionsRDD[11] at repartition at <console>:26

scala> res16.partitions.length
res17: Int = 10

scala>  pairMrkt.partitions.length
res20: Int = 2

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  花落未央        
                
              
                            
                2020-11-22 17:18
              
            
            
                                                                       
In a simple way
COALESCE :- is only for decreases the no of partitions , No shuffling of data it just compress the partitions

REPARTITION:- is for both increase and decrease the no of partitions , But shuffling takes place

Example:-

val rdd = sc.textFile("path",7)
rdd.repartition(10)
rdd.repartition(2)


Both works fine

But we go generally  for this two things when we need to see output in one cluster,we go with this.  
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  感情败类        
                
              
                            
                2020-11-22 17:19
              
            
            
                                                                       
What follows from the code and code docs is that coalesce(n) is the same as coalesce(n, shuffle = false) and repartition(n) is the same as coalesce(n, shuffle = true)

Thus, both coalesce and repartition can be used to increase number of partitions


  With shuffle = true, you can actually coalesce to a larger number
    of partitions. This is useful if you have a small number of partitions,
    say 100, potentially with a few partitions being abnormally large.


Another important note to accentuate is that if you drastically decrease number of partitions you should consider using shuffled version of coalesce (same as repartition in that case). This will allow your computations be performed in parallel on parent partitions (multiple task).


  However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can pass shuffle = true. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).


Please also refer to the related answer here
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤城傲影        
                
              
                            
                2020-11-22 17:25
              
            
            
                                                                       
All the answers are adding some great knowledge into this very often asked question.

So going by tradition of this question's timeline, here are my 2 cents.

I found the repartition to be faster than coalesce, in very specific case.

In my application when the number of files that we estimate is lower than the certain threshold, repartition works faster. 

Here is what I mean

if(numFiles > 20)
    df.coalesce(numFiles).write.mode(SaveMode.Overwrite).parquet(dest)
else
    df.repartition(numFiles).write.mode(SaveMode.Overwrite).parquet(dest)


In above snippet, if my files were less than 20, coalesce was taking forever to finish while repartition was much faster and so the above code.

Of course, this number (20) will depend on the number of workers and amount of data.

Hope that helps.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  深忆病人        
                
              
                            
                2020-11-22 17:29
              
            
            
                                                                       
The repartition algorithm does a full shuffle of the data and creates equal sized partitions of data. coalesce combines existing partitions to avoid a full shuffle.

Coalesce works well for taking an RDD with a lot of partitions and combining partitions on a single worker node to produce a final RDD with less partitions.

Repartition will reshuffle the data in your RDD to produce the final number of partitions you request.
The partitioning of DataFrames seems like a low level implementation detail that should be managed by the framework, but it’s not. When filtering large DataFrames into smaller ones, you should almost always repartition the data.
You’ll probably be filtering large DataFrames into smaller ones frequently, so get used to repartitioning.

Read this blog post if you'd like even more details.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  忘了有多久        
                
              
                            
                2020-11-22 17:30
              
            
            
                                                                       
Also another difference is taking into consideration a situation where there is a skew join and you have to coalesce on top of it.  A repartition will solve the skew join in most cases, then you can do the coalesce.
Another situation is, suppose you have saved a medium/large volume of data in a data frame and you have to produce to Kafka in batches.  A repartition helps to collectasList before producing to Kafka in certain cases.  But, when the volume is really high, the repartition will likely cause serious performance impact.  In that case, producing to Kafka directly from dataframe would help.
side notes: Coalesce does not avoid data movement as in full data movement between workers.   It does reduce the number of shuffles happening though.  I think that's what the book means.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
3
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复