I am using repartition on columns to store the data in parquet. But I see that the no. of parquet partitioned files are not same with the no. of Rdd partitions. Is ther
Couple of things here that you;re asking - Partitioning, Bucketing and Balancing of data,
Partitioning:
In Spark, this is done by df.write.partitionedBy(column*)
and groups data by partitioning columns
into same sub directory.
Bucketing:
Distribute By
In Spark, this is done by df.write.bucketBy(n, column*)
and groups data by partitioning columns
into same file. number of files generated is controlled by n
Repartition:
DataFrame
balanced evenly based on given partitioning expressions into given number of internal files. The resulting DataFrame is hash partitioned.In Spark, this is done by df.repartition(n, column*)
and groups data by partitioning columns
into same internal partition file. Note that no data is persisted to storage, this is just internal balancing of data based on constraints similar to bucketBy
Tl;dr
1) I am using repartition on columns to store the data in parquet. But I see that the no. of parquet partitioned files are not same with the no. of Rdd partitions. Is there no correlation between rdd partitions and parquet partitions?
spark.sql.shuffle.partitions
and spark.default.parallelism
2) When I write the data to parquet partition and I use Rdd repartition and then I read the data from parquet partition , is there any condition when the rdd partition numbers will be same during read / write?
spark.default.parallelism
3) How is bucketing a dataframe using a column id and repartitioning a dataframe via the same column id different?
4) While considering the performance of joins in Spark should we be looking at bucketing or repartitioning (or maybe both)
repartition
of both datasets are in memory, if one or both the datasets are persisted, then look into bucketBy
also.