I had been reading about spark predicates pushdown and partition pruning to understand the amount of data read. I had the following doubts related to the same
Suppos
1) When you use filters on the columns which you did partition on, Spark will skip those files completely and it wouldn't cost you any IO. If you look at your file structure it's stored as something like:
parquet-folder/Year=2019/SchoolName=XYZ/part1.parquet
parquet-folder/Year=2019/SchoolName=XYZ/part2.parquet
parquet-folder/Year=2019/SchoolName=XYZ/...
2) When you filter on some column that isn't in your partition, Spark will scan every part
file in every folder of that parquet table. Only when you have pushdown filtering, Spark will use the footer of every part
file (where min, max and count statistics are stored) to determine if your search value is within that range. If yes, Spark will read the file fully. If not, Spark will skip the whole file, not costing you at least the full read.