RDD split and do aggregation on new RDDs

前端未结

关注

 1  1963

情书的邮戳

I have an RDD of (String,String,Int).

I want to reduce it based on the first two strings
And Then based on the first String I want to group t

相关标签:

1条回答

时光取名叫无心

2021-01-23 03:52
There at least few problems with a way you group your data. The first problem is introduced by
```
 mapValues(x => ArrayBuffer(x))
```
It creates a large amount of mutable objects which provide no additional value since you cannot leverage their mutability in the subsequent reduceByKey
```
reduceByKey((x, y) => x ++ y) 
```
where each ++ creates a new collection and neither argument can be safely mutated. Since reduceByKey applies map side aggregation situation is even worse and pretty much creates GC hell.

Is there a way to achieve this more efficiently?

Unless you have some deeper knowledge about data distribution which can be used to define smarter partitioner the simplest improvement is to replace mapValues + reduceByKey with simple groupByKey:
```
val r3 = r2.groupByKey
```
It should be also possible to use a custom partitioner for both reduceByKey calls and mapPartitions with preservesPartitioning instead of map.
```
class FirsElementPartitioner(partitions: Int)
    extends org.apache.spark.Partitioner {
  def numPartitions  = partitions
  def getPartition(key: Any): Int = {
    key.asInstanceOf[(Any, Any)]._1.## % numPartitions
  }
}

val r2 = r1
  .reduceByKey(new FirsElementPartitioner(8), (x, y) => x + y)
  .mapPartitions(iter => iter.map(x => ((x._1._1), (x._1._2, x._2))), true)

// No shuffle required here.
val r3 = r2.groupByKey
```
It requires only a single shuffle and groupByKey is simply a local operations:
```
r3.toDebugString
// (8) MapPartitionsRDD[41] at groupByKey at <console>:37 []
//  |  MapPartitionsRDD[40] at mapPartitions at <console>:35 []
//  |  ShuffledRDD[39] at reduceByKey at <console>:34 []
//  +-(8) MapPartitionsRDD[1] at map at <console>:28 []
//     |  ParallelCollectionRDD[0] at parallelize at <console>:26 []
```
0 讨论(0)
发布评论:

提交评论
- 加载中...