Spark Predicate Push Down, Filtering and Partition Pruning for Azure Data Lake

后端 未结 1 1200
庸人自扰
庸人自扰 2021-01-06 14:43

I had been reading about spark predicates pushdown and partition pruning to understand the amount of data read. I had the following doubts related to the same

Suppos

1条回答
  •  一生所求
    2021-01-06 15:13

    1) When you use filters on the columns which you did partition on, Spark will skip those files completely and it wouldn't cost you any IO. If you look at your file structure it's stored as something like:

    parquet-folder/Year=2019/SchoolName=XYZ/part1.parquet
    parquet-folder/Year=2019/SchoolName=XYZ/part2.parquet
    parquet-folder/Year=2019/SchoolName=XYZ/...
    

    2) When you filter on some column that isn't in your partition, Spark will scan every part file in every folder of that parquet table. Only when you have pushdown filtering, Spark will use the footer of every part file (where min, max and count statistics are stored) to determine if your search value is within that range. If yes, Spark will read the file fully. If not, Spark will skip the whole file, not costing you at least the full read.

    0 讨论(0)
提交回复
热议问题