Spark DataFrames with Parquet and Partitioning

后端 未结 2 572
余生分开走
余生分开走 2021-01-05 08:48

I have not been able to find much information on this topic but lets say we use a dataframe to read in a parquet file that is 10 Blocks spark will naturally create 10 partit

2条回答
  •  生来不讨喜
    2021-01-05 09:12

    Spark DataFrame doesn't load parquet files in memory. It uses Hadoop/HDFS API to read it during each operation. So the optimal number of partitions depends on HDFS block size (different from a Parquet block size!).

    Spark 1.5 DataFrame partitions parquet file as follows:

    • 1 partition per HDFS block
    • If HDFS block size is less than configured in Spark parquet block size a partition will be created for multiple HDFS blocks such as total size of partition is no less than parquet block size

提交回复
热议问题