Spark: Get top N by key

后端未结

关注

 4  1724

Say I have a PairRDD as such (Obviously much more data in real life, assume millions of records):

val scores = sc.parallelize(Array(
      (\"a\", 1),


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  夕颜        
                
              
                            
                2020-12-30 08:58
              
            
            
                                                                       
I think this should be quite efficient:

Edited according to OP comments: 



scores.mapValues(p => (p, p)).reduceByKey((u, v) => {
  val values = List(u._1, u._2, v._1, v._2).sorted(Ordering[Int].reverse).distinct
  if (values.size > 1) (values(0), values(1))
  else (values(0), values(0))
}).collect().foreach(println)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤街浪徒        
                
              
                            
                2020-12-30 08:58
              
            
            
                                                                       
 scores.reduceByKey(_ + _).map(x => x._2 -> x._1).sortByKey(false).map(x => x._2 -> x._1).take(2).foreach(println)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦如初夏        
                
              
                            
                2020-12-30 09:02
              
            
            
                                                                       
Since version 1.4, there is a built-in way to do this using MLLib: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala

import org.apache.spark.mllib.rdd.MLPairRDDFunctions.fromPairRDD
scores.topByKey(2)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  天涯浪人        
                
              
                            
                2020-12-30 09:02
              
            
            
                                                                       
Slightly modified your input data.

val scores = sc.parallelize(Array(
      ("a", 1),  
      ("a", 2), 
      ("a", 3), 
      ("b", 3), 
      ("b", 1), 
      ("a", 4),  
      ("b", 4), 
      ("b", 2),
      ("a", 6),
      ("b", 8)
    ))


I explain how to do it step by step:

1.Group by key to create array

scores.groupByKey().foreach(println)  


Result:

(b,CompactBuffer(3, 1, 4, 2, 8))
(a,CompactBuffer(1, 2, 3, 4, 6))


As you see, each value itself is a array of numbers. CompactBuffer is just optimised array.

2.For each key, reverse sort list of numbers that value contains

scores.groupByKey().map({ case (k, numbers) => k -> numbers.toList.sorted(Ordering[Int].reverse)} ).foreach(println)


Result:

(b,List(8, 4, 3, 2, 1))
(a,List(6, 4, 3, 2, 1))


3.Keep only first 2 elements from the 2nd step, they will be top 2 scores in the list

scores.groupByKey().map({ case (k, numbers) => k -> numbers.toList.sorted(Ordering[Int].reverse).take(2)} ).foreach(println)


Result:

(a,List(6, 4))
(b,List(8, 4))


4.Flat map to create new Paired RDD for each key and top score

scores.groupByKey().map({ case (k, numbers) => k -> numbers.toList.sorted(Ordering[Int].reverse).take(2)} ).flatMap({case (k, numbers) => numbers.map(k -> _)}).foreach(println)


Result:

(b,8)
(b,4)
(a,6)
(a,4)


5.Optional step - sort by key if you want

scores.groupByKey().map({ case (k, numbers) => k -> numbers.toList.sorted(Ordering[Int].reverse).take(2)} ).flatMap({case (k, numbers) => numbers.map(k -> _)}).sortByKey(false).foreach(println)


Result:

(a,6)
(a,4)
(b,8)
(b,4)


Hope, this explanation helped to understand the logic.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复