Prepare my bigdata with Spark via Python

后端未结

关注

 1  1153

My 100m in size, quantized data:

(1424411938\', [3885, 7898])
(3333333333\', [3885, 7898])

Desired result:

(3885, [33333333


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  佛祖请我去吃肉        
                
              
                            
                2020-12-21 21:49
              
            
            
                                                                       
You can use a bunch of basic pyspark transformations to achieve this.

>>> rdd = sc.parallelize([(1424411938, [3885, 7898]),(3333333333, [3885, 7898])])
>>> r = rdd.flatMap(lambda x: ((a,x[0]) for a in x[1]))


We used flatMap to have a key, value pair for every item in x[1] and we changed the data line format to (a, x[0]), the a here is every item in x[1]. To understand flatMap better you can look to the documentation.

>>> r2 = r.groupByKey().map(lambda x: (x[0],tuple(x[1])))


We just grouped all key, value pairs by their keys and used tuple function to convert iterable to tuple.

>>> r2.collect()
[(3885, (1424411938, 3333333333)), (7898, (1424411938, 3333333333))]


As you said you can use [:150] to have first 150 elements, I guess this would be proper usage:

r2 = r.groupByKey().map(lambda x: (x[0],tuple(x[1])[:150]))

I tried to be as explanatory as possible. I hope this helps.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复