(PySpark) Nested lists after reduceByKey

后端未结

关注

 2  1836

I\'m sure this is something very simple but I didn\'t find anything related to this.

My code is simple:

... 
stream = stream.map(mapper) 
stream =


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  你的背包        
                
              
                            
                2021-01-06 23:58
              
            
            
                                                                       
Alternatively, stream.groupByKey().mapValues(lambda x: list(x)).collect() gives 

key1 [value1]
key2 [value2, value3]
key3 [value4, value5, value6]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  说谎        
                
              
                            
                2021-01-07 00:02
              
            
            
                                                                       
The problem here is your reduce function.  For each key, reduceByKey calls your reduce function with pairs of values and expects it to produce combined values of the same type.

For example, say that I wanted to perform a word count operation.  First, I can map each word to a (word, 1) pair, then I can reduceByKey(lambda x, y: x + y) to sum up the counts for each word.  At the end, I'm left with an RDD of (word, count) pairs.

Here's an example from the PySpark API Documentation:

>>> from operator import add
>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> sorted(rdd.reduceByKey(add).collect())
[('a', 2), ('b', 1)]


To understand why your example didn't work, you can imagine the reduce function being applied something like this:

reduce(reduce(reduce(firstValue, secondValue), thirdValue), fourthValue) ...


Based on your reduce function, it sounds like you might be trying to implement the built-in groupByKey operation, which groups each key with a list of its values.

Also, take a look at combineByKey, a generalization of reduceByKey() that allows the reduce function's input and output types to differ (reduceByKey is implemented in terms of combineByKey)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复