Spark DataFrames with Parquet and Partitioning

后端未结

关注

 2  571

I have not been able to find much information on this topic but lets say we use a dataframe to read in a parquet file that is 10 Blocks spark will naturally create 10 partit

相关标签:

2条回答

生来不讨喜

2021-01-05 09:12
Spark DataFrame doesn't load parquet files in memory. It uses Hadoop/HDFS API to read it during each operation. So the optimal number of partitions depends on HDFS block size (different from a Parquet block size!).

Spark 1.5 DataFrame partitions parquet file as follows:
- 1 partition per HDFS block
- If HDFS block size is less than configured in Spark parquet block size a partition will be created for multiple HDFS blocks such as total size of partition is no less than parquet block size
0 讨论(0)
发布评论:

提交评论
- 加载中...
感情败类

2021-01-05 09:25

I saw the other answer but I thought I can clarify more on this. If you are reading Parquet from posix filesystem then you can increase number of partitioning readings by just having more workers in Spark.

But in order to control the balance of data that comes into workers one may use the hierarchical data structure of the Parquet files, and later in the workers you may point to different partitions or parts of the Parquet file. This will give you control over how much of data should go to each worker according to the domain of your dataset (if by balancing data in workers you mean equal batch of data per worker is not efficient).

0 讨论(0)
发布评论:

提交评论
- 加载中...