Spark Predicate Push Down, Filtering and Partition Pruning for Azure Data Lake

后端未结

关注

 1  1200

庸人自扰 2021-01-06 14:43

I had been reading about spark predicates pushdown and partition pruning to understand the amount of data read. I had the following doubts related to the same

Suppos

1条回答

一生所求 (楼主)

2021-01-06 15:13
1) When you use filters on the columns which you did partition on, Spark will skip those files completely and it wouldn't cost you any IO. If you look at your file structure it's stored as something like:
```
parquet-folder/Year=2019/SchoolName=XYZ/part1.parquet
parquet-folder/Year=2019/SchoolName=XYZ/part2.parquet
parquet-folder/Year=2019/SchoolName=XYZ/...
```
2) When you filter on some column that isn't in your partition, Spark will scan every part file in every folder of that parquet table. Only when you have pushdown filtering, Spark will use the footer of every part file (where min, max and count statistics are stored) to determine if your search value is within that range. If yes, Spark will read the file fully. If not, Spark will skip the whole file, not costing you at least the full read.
0 讨论(0)
发布评论:

提交评论
- 加载中...