Spark: groupBy taking lot of time

后端 未结 2 590
别那么骄傲
别那么骄傲 2021-01-07 07:28

In my application when taking perfromance numbers, groupby is eating away lot of time.

My RDD is of below strcuture:

JavaPairRDD

        
相关标签:
2条回答
  • 2021-01-07 08:11

    Shuffling is triggered by any change in the key of a [K,V] pair, or by a repartition() call. The partitioning is calculated based on the K (key) value. By default partitioning is calculated using the Hash value of your key, implemented by the hashCode() method. In your case your Key contains two Map instance variables. The default implementation of the hashCode() method will have to calculate the hashCode() of those maps as well, causing an iteration to happen over all it elements to in turn again calculate the hashCode() of those elements.

    The solutions are:

    1. Do not include the Map instances in your Key. This seems highly unusual.
    2. Implement and override your own hashCode() that avoids going through the Map Instance variables.
    3. Possibly you can avoid using the Map objects completely. If it is something that is shared amongst multiple elements, you might need to consider using broadcast variables in spark. The overhead of serializing your Maps during shuffling might also be a big contributing factor.
    4. Avoid any shuffling, by tuning your hashing between two consecutive group-by's.
    5. Keep shuffling Node local, by choosing a Partitioner that will have an affinity of keeping partitions local during consecutive use.

    Good reading on hashCode(), including a reference to quotes by Josh Bloch can be found in wiki.

    0 讨论(0)
  • 2021-01-07 08:13

    The Spark's documentation encourages you to avoid operations groupBy operations instead they suggest combineByKey or some of its derivated operation (reduceByKey or aggregateByKey). You have to use this operation in order to make an aggregation before and after the shuffle (in the Map's and in the Reduce's phase if we use Hadoop terminology) so your execution times will improve (i don't kwown if it will be 10 times better but it has to be better)

    If i understand your processing i think that you can use a single combineByKey operation The following code's explanation is made for a scala code but you can translate to Java code without too many effort.

    combineByKey have three arguments: combineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]

    • createCombiner: In this operation you create a new class in order to combine your data so you could aggregate your CustomTuple data into a new Class CustomTupleCombiner (i don't know if you want only make a sum or maybe you want to apply some process to this data but either option can be made in this operation)

    • mergeValue: In this operation you have to describe how a CustomTuple is sum to another CustumTupleCombiner(again i am presupposing a simple summarize operation). For example if you want sum the data by key, you will have in your CustumTupleCombiner class a Map so the operation should be something like: CustumTupleCombiner.sum(CustomTuple) that make CustumTupleCombiner.Map(CustomTuple.key)-> CustomTuple.Map(CustomTuple.key) + CustumTupleCombiner.value

    • mergeCombiners: In this operation you have to define how merge two Combiner class, CustumTupleCombiner in my example. So this will be something like CustumTupleCombiner1.merge(CustumTupleCombiner2) that will be something like CustumTupleCombiner1.Map.keys.foreach( k -> CustumTupleCombiner1.Map(k)+CustumTupleCombiner2.Map(k)) or something like that

    The pated code is not proved (this will not even compile because i made it with vim) but i think that might work for your scenario.

    I hope this will be usefull

    0 讨论(0)
提交回复
热议问题