My dataset is partitioned in this way:
Year=yyyy
|---Month=mm
| |---Day=dd
| | |---
What is the easiest and effic
Edited to add multiple load paths to address comment.
You can use a regex style syntax.
val dataset = spark
.read
.format("parquet")
.option("filterPushdown", "true")
.option("basePath", "hdfs:///basepath/")
.load("hdfs:///basepath/Year=2017/Month=10/Day={0[6-9],[1-3][0-9]}/*/",
"hdfs:///basepath/Year=2017/Month=11/Day={0[1-3]}/*/")
How to use regex to include/exclude some input files in sc.textFile?
Note: you don't need the X=*
you can just do *
if you want all days, months, etc.
You should probably also do some reading about Predicate Pushdown (ie filterPushdown set to true above).
Finally, you will notice the basepath option above, the reason for that can be found here: Prevent DataFrame.partitionBy() from removing partitioned columns from schema