I am trying to write out a large partitioned dataset to disk with Spark and the partitionBy
algorithm is struggling with both of the approaches I\'ve tried.
The simplest solution is to add one or more columns to repartition
and explicitly set the number of partitions.
val numPartitions = ???
df.repartition(numPartitions, $"some_col", $"some_other_col")
.write.partitionBy("some_col")
.parquet("partitioned_lake")
where:
numPartitions
- should be an upper bound (actual number can be lower) of the desired number of files written to a partition directory.$"some_other_col"
(and optional additional columns) should have high cardinality and be independent of the $"some_column
(there should be functional dependency between these two, and shouldn't be highly correlated).
If data doesn't contain such column you can use o.a.s.sql.functions.rand
.
import org.apache.spark.sql.functions.rand
df.repartition(numPartitions, $"some_col", rand)
.write.partitionBy("some_col")
.parquet("partitioned_lake")