How to (equally) partition array-data in spark dataframe

前端未结

关注

 1  1199

I have a dataframe of the following form:

import scala.util.Random
val localData = (1 to 100).map(i => (i,Seq.fill(Math.abs(Random.nextGaussian()*100).toIn


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  没有蜡笔的小新        
                
              
                            
                2020-12-17 04:14
              
            
            
                                                                       
As you said, you can increase the amount of partitions.  I usually use a multiple of the number of cores:  spark context default parallelism * 2-3..

In your case, you could use a bigger multiplier.

Another solution would be to filter split your df in this way: 


df with only bigger arrays
df with the rest


You could then repartition each of them, perform computation and union them back.

Beware that repartitionning may be expensive since you have large rows to shuffle around.

You could have a look at theses slides (27+): https://www.slideshare.net/SparkSummit/custom-applications-with-sparks-rdd-spark-summit-east-talk-by-tejas-patil

They were experiencing very bad data skew and had to handle it in an interesting way.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复