Spark - repartition() vs coalesce()

前端未结

关注

 14  1792

According to Learning Spark

Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of


                      
              相关标签:


      
      
        
          14条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  南笙        
                
              
                            
                2020-11-22 17:42
              
            
            
                                                                       
repartition - it's recommended to use it while increasing the number of partitions, because it involve shuffling of all the data.

coalesce - it's is recommended to use it while reducing the number of partitions. For example if you have 3 partitions and you want to reduce it to 2, coalesce will move the 3rd partition data to partition 1 and 2. Partition 1 and 2 will remains in the same container.
On the other hand, repartition will shuffle data in all the partitions, therefore the network usage between the executors will be high and it will impacts the performance.

coalesce performs better than repartition while reducing the number of partitions.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  隐瞒了意图╮        
                
              
                            
                2020-11-22 17:42
              
            
            
                                                                       
But also you should make sure that, the data which is coming coalesce nodes should have highly configured, if you are dealing with huge data. Because all the data will be loaded to those nodes, may lead memory exception.
Though reparation is costly, i prefer to use it. Since it shuffles and distribute the data equally.

Be wise to select between coalesce and repartition.  
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
3
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复