Spark DataFrames with Parquet and Partitioning

后端未结

关注

 2  572

余生分开走 2021-01-05 08:48

I have not been able to find much information on this topic but lets say we use a dataframe to read in a parquet file that is 10 Blocks spark will naturally create 10 partit

2条回答

生来不讨喜 (楼主)

2021-01-05 09:12
Spark DataFrame doesn't load parquet files in memory. It uses Hadoop/HDFS API to read it during each operation. So the optimal number of partitions depends on HDFS block size (different from a Parquet block size!).

Spark 1.5 DataFrame partitions parquet file as follows:
- 1 partition per HDFS block
- If HDFS block size is less than configured in Spark parquet block size a partition will be created for multiple HDFS blocks such as total size of partition is no less than parquet block size
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...