Apache Spark Handling Skewed Data

前端 未结 1 365
轻奢々
轻奢々 2020-12-30 09:20

I have two tables I would like to join together. One of them has a very bad skew of data. This is causing my spark job to not run in parallel as a majority of the work is do

相关标签:
1条回答
  • 2020-12-30 10:00

    Yes you should use salted keys on the larger table (via randomization) and then replicate the smaller one / cartesian join it to the new salted one:

    Here are a couple of suggestions:

    Tresata skew join RDD https://github.com/tresata/spark-skewjoin

    python skew join: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/

    The tresata library looks like this:

    import com.tresata.spark.skewjoin.Dsl._  // for the implicits   
    
    // skewjoin() method pulled in by the implicits
    rdd1.skewJoin(rdd2, defaultPartitioner(rdd1, rdd2),   
    DefaultSkewReplication(1)).sortByKey(true).collect.toLis
    
    0 讨论(0)
提交回复
热议问题