Spark SQL queries on partitioned data using Date Ranges

前端 未结 2 1162
轻奢々
轻奢々 2021-02-04 14:41

My dataset is partitioned in this way:

Year=yyyy
 |---Month=mm
 |   |---Day=dd
 |   |   |---

What is the easiest and effic

2条回答
  •  不知归路
    2021-02-04 15:02

    Edited to add multiple load paths to address comment.

    You can use a regex style syntax.

    val dataset = spark
      .read
      .format("parquet")
      .option("filterPushdown", "true")
      .option("basePath", "hdfs:///basepath/")
      .load("hdfs:///basepath/Year=2017/Month=10/Day={0[6-9],[1-3][0-9]}/*/",
        "hdfs:///basepath/Year=2017/Month=11/Day={0[1-3]}/*/")
    

    How to use regex to include/exclude some input files in sc.textFile?

    Note: you don't need the X=* you can just do * if you want all days, months, etc.

    You should probably also do some reading about Predicate Pushdown (ie filterPushdown set to true above).

    Finally, you will notice the basepath option above, the reason for that can be found here: Prevent DataFrame.partitionBy() from removing partitioned columns from schema

提交回复
热议问题