AWS Glue predicate push down condition has no effect

前端 未结 2 417
一整个雨季
一整个雨季 2021-01-14 11:34

I have a MySQL source from which I am creating a Glue Dynamic Frame with predicate push down condition as follows

datasource = glueContext.create_dynamic_fra         


        
相关标签:
2条回答
  • 2021-01-14 12:18

    This is great! I was able to use it to obtain the last 30 days of data using my "dt" partition column:

    datasource0 = glueContext.create_dynamic_frame.from_catalog(
        database = "my_db",
        table_name = "my_table",
        push_down_predicate = "to_date(dt) >= date_sub(current_date, 30)", 
        transformation_ctx = "datasource0"
    )
    

    I'm using Glue 1.0 - Spark 2.4 - Python 2.

    0 讨论(0)
  • 2021-01-14 12:23

    Pushdown predicate works for partitioning columns only. In other words, your data files should be placed in hierarchically structured folders. For example, if data is located in s3://bucket/dataset/ and partitioned by year, month and day then the structure should be following:

    s3://bucket/dataset/year=2018/month=7/day=18/<data-files-here>
    

    In such case pushdown predicate would work for columns year, month and day only:

    datasource = glueContext.create_dynamic_frame_from_catalog(
        database = source_catalog_db, 
        table_name = source_catalog_tbl, 
        push_down_predicate = "year = 2017 and month > 6 and day between 3 and 10", 
        transformation_ctx = "datasource")
    

    Besides that you have to keep in mind that pushdown predicates work with s3 data sources only.

    Here is a nice blog post written by AWS Glue devs about data partitioning.

    0 讨论(0)
提交回复
热议问题