How to find max value in pair RDD?

前端 未结 4 1090
攒了一身酷
攒了一身酷 2020-12-01 14:30

I have a spark pair RDD (key, count) as below

Array[(String, Int)] = Array((a,1), (b,2), (c,1), (d,3))

How to find the key with highest co

相关标签:
4条回答
  • 2020-12-01 15:02

    Use takeOrdered(1)(Ordering[Int].reverse.on(_._2)):

    val a = Array(("a",1), ("b",2), ("c",1), ("d",3))
    val rdd = sc.parallelize(a)
    val maxKey = rdd.takeOrdered(1)(Ordering[Int].reverse.on(_._2))
    // maxKey: Array[(String, Int)] = Array((d,3))
    

    Quoting the note from RDD.takeOrdered:

    This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.

    0 讨论(0)
  • 2020-12-01 15:07

    Use Array.maxBy method:

    val a = Array(("a",1), ("b",2), ("c",1), ("d",3))
    val maxKey = a.maxBy(_._2)
    // maxKey: (String, Int) = (d,3)
    

    or RDD.max:

    val maxKey2 = rdd.max()(new Ordering[Tuple2[String, Int]]() {
      override def compare(x: (String, Int), y: (String, Int)): Int = 
          Ordering[Int].compare(x._2, y._2)
    })
    
    0 讨论(0)
  • 2020-12-01 15:10

    For Pyspark:

    Let a be the pair RDD with keys as String and values as integers then

    a.max(lambda x:x[1])
    

    returns the key value pair with the maximum value. Basically the max function orders by the return value of the lambda function.

    Here a is a pair RDD with elements such as ('key',int) and x[1] just refers to the integer part of the element.

    Note that the max function by itself will order by key and return the max value.

    Documentation is available at https://spark.apache.org/docs/1.5.0/api/python/pyspark.html#pyspark.RDD.max

    0 讨论(0)
  • 2020-12-01 15:23

    Spark RDD's are more efficient timewise when they are left as RDD's and not turned into Arrays

    strinIntTuppleRDD.reduce((x, y) => if(x._2 > y._2) x else y)
    
    0 讨论(0)
提交回复
热议问题