How to (equally) partition array-data in spark dataframe

前端 未结 1 1199
隐瞒了意图╮
隐瞒了意图╮ 2020-12-17 04:08

I have a dataframe of the following form:

import scala.util.Random
val localData = (1 to 100).map(i => (i,Seq.fill(Math.abs(Random.nextGaussian()*100).toIn         


        
相关标签:
1条回答
  • 2020-12-17 04:14

    As you said, you can increase the amount of partitions. I usually use a multiple of the number of cores: spark context default parallelism * 2-3..
    In your case, you could use a bigger multiplier.

    Another solution would be to filter split your df in this way:

    • df with only bigger arrays
    • df with the rest

    You could then repartition each of them, perform computation and union them back.

    Beware that repartitionning may be expensive since you have large rows to shuffle around.

    You could have a look at theses slides (27+): https://www.slideshare.net/SparkSummit/custom-applications-with-sparks-rdd-spark-summit-east-talk-by-tejas-patil

    They were experiencing very bad data skew and had to handle it in an interesting way.

    0 讨论(0)
提交回复
热议问题