How can I partition a RDD respecting order?

后端未结

关注

 1  1885

I would like to divide a RDD in a number of partitions corresponding to the number of different keys I found (3 in this case):

RDD: [(1,a), (1,b), (1,c), (2,d), (3


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  無奈伤痛        
                
              
                            
                2021-01-29 08:11
              
            
            
                                                                       
Assigning your partitions like this

p1 [(1,a), (1,b), (1,c)]
p2 [(2,d), (3,e), (3,f)]
p3 [(3,g), (3,h), (3,i)]


would mean that you would like to assign the same partition key to different partitions (for 3 it's p2 or p3). Just like for mathematical functions it cannot have many values for the same argument (what does the value depend on then?).

What you could do instead is adding something to your partition key which would result in having more buckets (effectively splitting one set into smaller sets). But you have no control (virtually) over how Spark places your partitions onto the nodes so the data which you wanted to be on the same node can span across multiple nodes.

That really boils down to what job you would like to perform. I would recommend considering what is the outcome you want to get and see if you can come up with some smart partition key with a reasonable tradeoff (if really necessary). Maybe you could hold values by the letter and then use operations like reduceByKey rather than the groupByKey to get your final results?
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复