Spark - repartition() vs coalesce()

前端未结

关注

 14  1793

According to Learning Spark

Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of


                      
              相关标签:


      
      
        
          14条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  无人及你        
                
              
                            
                2020-11-22 17:31
              
            
            
                                                                       
I would like to add to Justin and Power's answer that - 

repartition will ignore existing partitions and create new ones. So you can use it to fix data skew. You can mention partition keys to define the distribution. Data skew is one of the biggest problems in the 'big data' problem space.

coalesce will work with existing partitions and shuffle a subset of them. It can't fix the data skew as much as repartition does. Therefore even if it is less expensive it might not be the thing you need.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  忘掉有多难        
                
              
                            
                2020-11-22 17:31
              
            
            
                                                                       
For someone who had issues generating a single csv file from PySpark (AWS EMR) as an output and saving it on s3, using repartition helped. The reason being, coalesce cannot do a full shuffle, but repartition can. Essentially, you can increase or decrease the number of partitions using repartition, but can only decrease the number of partitions (but not 1) using coalesce. Here is the code for anyone who is trying to write a csv from AWS EMR to s3:

df.repartition(1).write.format('csv')\
.option("path", "s3a://my.bucket.name/location")\
.save(header = 'true')

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  时光取名叫无心        
                
              
                            
                2020-11-22 17:32
              
            
            
                                                                       
It avoids a full shuffle. If it's known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept. 

So, it would go something like this:

Node 1 = 1,2,3
Node 2 = 4,5,6
Node 3 = 7,8,9
Node 4 = 10,11,12


Then coalesce down to 2 partitions:

Node 1 = 1,2,3 + (10,11,12)
Node 3 = 7,8,9 + (4,5,6)


Notice that Node 1 and Node 3 did not require its original data to move.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  说谎        
                
              
                            
                2020-11-22 17:36
              
            
            
                                                                       
To all the great answers I would like to add that repartition is one the best option to take advantage of data parallelization. While coalesce gives a cheap option to reduce the partitions and it is very useful when writing data to HDFS or some other sink to take advantage of big writes.

I have found this useful when writing data in parquet format to get full advantage.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  不要未来只要你来        
                
              
                            
                2020-11-22 17:39
              
            
            
                                                                       
Justin's answer is awesome and this response goes into more depth.

The repartition algorithm does a full shuffle and creates new partitions with data that's distributed evenly.  Let's create a DataFrame with the numbers from 1 to 12.

val x = (1 to 12).toList
val numbersDf = x.toDF("number")


numbersDf contains 4 partitions on my machine.

numbersDf.rdd.partitions.size // => 4


Here is how the data is divided on the partitions:

Partition 00000: 1, 2, 3
Partition 00001: 4, 5, 6
Partition 00002: 7, 8, 9
Partition 00003: 10, 11, 12


Let's do a full-shuffle with the repartition method and get this data on two nodes.

val numbersDfR = numbersDf.repartition(2)


Here is how the numbersDfR data is partitioned on my machine:

Partition A: 1, 3, 4, 6, 7, 9, 10, 12
Partition B: 2, 5, 8, 11


The repartition method makes new partitions and evenly distributes the data in the new partitions (the data distribution is more even for larger data sets).

Difference between coalesce and repartition

coalesce uses existing partitions to minimize the amount of data that's shuffled.  repartition creates new partitions and does a full shuffle.  coalesce results in partitions with different amounts of data (sometimes partitions that have much different sizes) and repartition results in roughly equal sized partitions.

Is coalesce or repartition faster?

coalesce may run faster than repartition, but unequal sized partitions are generally slower to work with than equal sized partitions.  You'll usually need to repartition datasets after filtering a large data set.  I've found repartition to be faster overall because Spark is built to work with equal sized partitions.

N.B. I've curiously observed that repartition can increase the size of data on disk.  Make sure to run tests when you're using repartition / coalesce on large datasets.

Read this blog post if you'd like even more details.

When you'll use coalesce & repartition in practice


See this question on how to use coalesce & repartition to write out a DataFrame to a single file
It's critical to repartition after running filtering queries.  The number of partitions does not change after filtering, so if you don't repartition, you'll have way too many memory partitions (the more the filter reduces the dataset size, the bigger the problem).  Watch out for the empty partition problem.
partitionBy is used to write out data in partitions on disk.  You'll need to use repartition / coalesce to partition your data in memory properly before using partitionBy.  

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  猫巷女王i        
                
              
                            
                2020-11-22 17:39
              
            
            
                                                                       
Repartition: Shuffle the data into a NEW number of partitions.

Eg. Initial data frame is partitioned in 200 partitions.

df.repartition(500): Data will be shuffled from 200 partitions to new 500 partitions.

Coalesce: Shuffle the data into existing number of partitions.

df.coalesce(5): Data will be shuffled from remaining 195 partitions to 5 existing partitions.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
3
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复