RDD split and do aggregation on new RDDs

前端 未结 1 1964
情书的邮戳
情书的邮戳 2021-01-23 03:35

I have an RDD of (String,String,Int).

  1. I want to reduce it based on the first two strings
  2. And Then based on the first String I want to group t
1条回答
  •  时光取名叫无心
    2021-01-23 03:52

    There at least few problems with a way you group your data. The first problem is introduced by

     mapValues(x => ArrayBuffer(x))
    

    It creates a large amount of mutable objects which provide no additional value since you cannot leverage their mutability in the subsequent reduceByKey

    reduceByKey((x, y) => x ++ y) 
    

    where each ++ creates a new collection and neither argument can be safely mutated. Since reduceByKey applies map side aggregation situation is even worse and pretty much creates GC hell.

    Is there a way to achieve this more efficiently?

    Unless you have some deeper knowledge about data distribution which can be used to define smarter partitioner the simplest improvement is to replace mapValues + reduceByKey with simple groupByKey:

    val r3 = r2.groupByKey
    

    It should be also possible to use a custom partitioner for both reduceByKey calls and mapPartitions with preservesPartitioning instead of map.

    class FirsElementPartitioner(partitions: Int)
        extends org.apache.spark.Partitioner {
      def numPartitions  = partitions
      def getPartition(key: Any): Int = {
        key.asInstanceOf[(Any, Any)]._1.## % numPartitions
      }
    }
    
    val r2 = r1
      .reduceByKey(new FirsElementPartitioner(8), (x, y) => x + y)
      .mapPartitions(iter => iter.map(x => ((x._1._1), (x._1._2, x._2))), true)
    
    // No shuffle required here.
    val r3 = r2.groupByKey
    

    It requires only a single shuffle and groupByKey is simply a local operations:

    r3.toDebugString
    // (8) MapPartitionsRDD[41] at groupByKey at :37 []
    //  |  MapPartitionsRDD[40] at mapPartitions at :35 []
    //  |  ShuffledRDD[39] at reduceByKey at :34 []
    //  +-(8) MapPartitionsRDD[1] at map at :28 []
    //     |  ParallelCollectionRDD[0] at parallelize at :26 []
    

    0 讨论(0)
提交回复
热议问题