Partitioning a large skewed dataset in S3 with Spark's partitionBy method

前端 未结 1 1711
悲&欢浪女
悲&欢浪女 2020-12-14 11:36

I am trying to write out a large partitioned dataset to disk with Spark and the partitionBy algorithm is struggling with both of the approaches I\'ve tried.

相关标签:
1条回答
  • 2020-12-14 12:04

    The simplest solution is to add one or more columns to repartition and explicitly set the number of partitions.

    val numPartitions = ???
    
    df.repartition(numPartitions, $"some_col", $"some_other_col")
     .write.partitionBy("some_col")
     .parquet("partitioned_lake")
    

    where:

    • numPartitions - should be an upper bound (actual number can be lower) of the desired number of files written to a partition directory.
    • $"some_other_col" (and optional additional columns) should have high cardinality and be independent of the $"some_column (there should be functional dependency between these two, and shouldn't be highly correlated).

      If data doesn't contain such column you can use o.a.s.sql.functions.rand.

      import org.apache.spark.sql.functions.rand
      
      df.repartition(numPartitions, $"some_col", rand)
        .write.partitionBy("some_col")
        .parquet("partitioned_lake")
      
    0 讨论(0)
提交回复
热议问题