take top N after groupBy and treat them as RDD

后端 未结 4 458
死守一世寂寞
死守一世寂寞 2020-12-10 08:09

I\'d like to get top N items after groupByKey of RDD and convert the type of topNPerGroup(in the below) to RDD[(String, Int)]

相关标签:
4条回答
  • 2020-12-10 08:17

    Spark 1.4.0 solves the question.

    Take a look at https://github.com/apache/spark/commit/5e6ad24ff645a9b0f63d9c0f17193550963aa0a7

    This uses BoundedPriorityQueue with aggregateByKey

    def topByKey(num: Int)(implicit ord: Ordering[V]): RDD[(K, Array[V])] = {
      self.aggregateByKey(new BoundedPriorityQueue[V](num)(ord))(
        seqOp = (queue, item) => {
          queue += item
        },
        combOp = (queue1, queue2) => {
          queue1 ++= queue2
        }
      ).mapValues(_.toArray.sorted(ord.reverse))  // This is an min-heap, so we reverse the order.
    }
    
    0 讨论(0)
  • 2020-12-10 08:23

    Your question is a little confusing, but I think this does what you're looking for:

    val flattenedTopNPerGroup = 
        topNPerGroup.flatMap({case (key, numbers) => numbers.map(key -> _)})
    

    and in the repl it prints out what you want:

    flattenedTopNPerGroup.collect.foreach(println)
    (foo,3)
    (foo,2)
    (bar,6)
    (bar,5)
    
    0 讨论(0)
  • 2020-12-10 08:23

    I've been struggling with this same issue recently but my need was a little different in that I needed the top K values per key with a data set like (key: Int, (domain: String, count: Long)). While your dataset is simpler there is still a scaling/performance issue by using groupByKey as noted in the documentation.

    When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs. Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or combineByKey will yield much better performance.

    In my case I ran into problems very quickly because my Iterable in (K, Iterable<V>) was very large, > 1 million, so the sorting and taking of the top N became very expensive and creates potential memory issues.

    After some digging, see references below, here is a full example using combineByKey to accomplish the same task in a way that will perform and scale.

    import org.apache.spark.SparkContext
    import org.apache.spark.SparkContext._
    
    object TopNForKey {
    
      var SampleDataset = List(
        (1, ("apple.com", 3L)),
        (1, ("google.com", 4L)),
        (1, ("stackoverflow.com", 10L)),
        (1, ("reddit.com", 15L)),
        (2, ("slashdot.org", 11L)),
        (2, ("samsung.com", 1L)),
        (2, ("apple.com", 9L)),
        (3, ("microsoft.com", 5L)),
        (3, ("yahoo.com", 3L)),
        (3, ("google.com", 4L)))
    
      //sort and trim a traversable (String, Long) tuple by _2 value of the tuple
      def topNs(xs: TraversableOnce[(String, Long)], n: Int) = {
        var ss = List[(String, Long)]()
        var min = Long.MaxValue
        var len = 0
        xs foreach { e =>
          if (len < n || e._2 > min) {
            ss = (e :: ss).sortBy((f) => f._2)
            min = ss.head._2
            len += 1
          }
          if (len > n) {
            ss = ss.tail
            min = ss.head._2
            len -= 1
          }
        }
        ss
      }
    
      def main(args: Array[String]): Unit = {
    
        val topN = 2
        val sc = new SparkContext("local", "TopN For Key")
        val rdd = sc.parallelize(SampleDataset).map((t) => (t._1, t._2))
    
        //use combineByKey to allow spark to partition the sorting and "trimming" across the cluster
        val topNForKey = rdd.combineByKey(
          //seed a list for each key to hold your top N's with your first record
          (v) => List[(String, Long)](v),
          //add the incoming value to the accumulating top N list for the key
          (acc: List[(String, Long)], v) => topNs(acc ++ List((v._1, v._2)), topN).toList,
          //merge top N lists returned from each partition into a new combined top N list
          (acc: List[(String, Long)], acc2: List[(String, Long)]) => topNs(acc ++ acc2, topN).toList)
    
        //print results sorting for pretty
        topNForKey.sortByKey(true).foreach((t) => {
          println(s"key: ${t._1}")
          t._2.foreach((v) => {
            println(s"----- $v")
          })
    
        })
    
      }
    
    }
    

    And what I get in the returning rdd...

    (1, List(("google.com", 4L),
             ("stackoverflow.com", 10L))
    (2, List(("apple.com", 9L),
             ("slashdot.org", 15L))
    (3, List(("google.com", 4L),
             ("microsoft.com", 5L))
    

    References

    https://www.mail-archive.com/user@spark.apache.org/msg16827.html

    https://stackoverflow.com/a/8275562/807318

    http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

    0 讨论(0)
  • 2020-12-10 08:36

    Just use topByKey:

    import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
    import org.apache.spark.rdd.RDD
    
    val topTwo: RDD[(String, Int)] = data.topByKey(2).flatMapValues(x => x)
    
    topTwo.collect.foreach(println)
    
    (foo,3)
    (foo,2)
    (bar,6)
    (bar,5)
    

    It is also possible provide alternative Ordering (not required here). For example if you wanted n smallest values:

    data.topByKey(2)(scala.math.Ordering.by[Int, Int](- _))
    
    0 讨论(0)
提交回复
热议问题