发表新帖

发表新帖

Spark SQL queries on partitioned data using Date Ranges

前端未结

关注

 2  1162

轻奢々 2021-02-04 14:41

My dataset is partitioned in this way:

Year=yyyy
 |---Month=mm
 |   |---Day=dd
 |   |   |---

What is the easiest and effic

2条回答

不知归路 (楼主)

2021-02-04 15:02
Edited to add multiple load paths to address comment.

You can use a regex style syntax.
```
val dataset = spark
  .read
  .format("parquet")
  .option("filterPushdown", "true")
  .option("basePath", "hdfs:///basepath/")
  .load("hdfs:///basepath/Year=2017/Month=10/Day={0[6-9],[1-3][0-9]}/*/",
    "hdfs:///basepath/Year=2017/Month=11/Day={0[1-3]}/*/")
```
How to use regex to include/exclude some input files in sc.textFile?

Note: you don't need the X=* you can just do * if you want all days, months, etc.

You should probably also do some reading about Predicate Pushdown (ie filterPushdown set to true above).

Finally, you will notice the basepath option above, the reason for that can be found here: Prevent DataFrame.partitionBy() from removing partitioned columns from schema
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题