How to find max value in pair RDD?

前端未结

关注

 4  1090

I have a spark pair RDD (key, count) as below

Array[(String, Int)] = Array((a,1), (b,2), (c,1), (d,3))

How to find the key with highest co

相关标签:

4条回答

失恋的感觉

2020-12-01 15:02
Use takeOrdered(1)(Ordering[Int].reverse.on(_._2)):
```
val a = Array(("a",1), ("b",2), ("c",1), ("d",3))
val rdd = sc.parallelize(a)
val maxKey = rdd.takeOrdered(1)(Ordering[Int].reverse.on(_._2))
// maxKey: Array[(String, Int)] = Array((d,3))
```
Quoting the note from RDD.takeOrdered:

This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.
0 讨论(0)
发布评论:

提交评论
- 加载中...

不思量自难忘°

2020-12-01 15:07

Use Array.maxBy method:

val a = Array(("a",1), ("b",2), ("c",1), ("d",3))
val maxKey = a.maxBy(_._2)
// maxKey: (String, Int) = (d,3)

or RDD.max:

val maxKey2 = rdd.max()(new Ordering[Tuple2[String, Int]]() {
  override def compare(x: (String, Int), y: (String, Int)): Int = 
      Ordering[Int].compare(x._2, y._2)
})

0 讨论(0)

盖世英雄少女心

2020-12-01 15:10
For Pyspark:

Let a be the pair RDD with keys as String and values as integers then
```
a.max(lambda x:x[1])
```
returns the key value pair with the maximum value. Basically the max function orders by the return value of the lambda function.

Here a is a pair RDD with elements such as ('key',int) and x[1] just refers to the integer part of the element.

Note that the max function by itself will order by key and return the max value.

Documentation is available at https://spark.apache.org/docs/1.5.0/api/python/pyspark.html#pyspark.RDD.max
0 讨论(0)
发布评论:

提交评论
- 加载中...
青春惊慌失措

2020-12-01 15:23
Spark RDD's are more efficient timewise when they are left as RDD's and not turned into Arrays
```
strinIntTuppleRDD.reduce((x, y) => if(x._2 > y._2) x else y)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...