How to choose the latest partition in BigQuery table?

后端 未结 7 933
暗喜
暗喜 2020-12-09 21:21

I am trying to select data from the latest partition in a date-partitioned BigQuery table, but the query still reads data from the whole table.

I\'ve tried (as far a

相关标签:
7条回答
  • 2020-12-09 21:43

    I had this answer in a less popular question, so copying it here as it's relevant (and this question is getting more pageviews):

    Mikhail's answer looks like this (working on public data):

    SELECT MAX(views)
    FROM `fh-bigquery.wikipedia_v3.pageviews_2019` 
    WHERE DATE(datehour) = DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)     
    AND wiki='es' 
    # 122.2 MB processed
    

    But it seems the question wants something like this:

    SELECT MAX(views)
    FROM `fh-bigquery.wikipedia_v3.pageviews_2019` 
    WHERE DATE(datehour) = (SELECT DATE(MAX(datehour)) FROM `fh-bigquery.wikipedia_v3.pageviews_2019` WHERE wiki='es')     
    AND wiki='es'
    # 50.6 GB processed
    

    ... but for way less than 50.6GB

    What you need now is some sort of scripting, to perform this in 2 steps:

    max_date = (SELECT DATE(MAX(datehour)) FROM `fh-bigquery.wikipedia_v3.pageviews_2019` WHERE wiki='es')   
    
    ;
    
    SELECT MAX(views)
    FROM `fh-bigquery.wikipedia_v3.pageviews_2019` 
    WHERE DATE(datehour) = {{max_date}}
    AND wiki='es'
    # 115.2 MB processed
    

    You will have to script this outside BigQuery - or wait for news on https://issuetracker.google.com/issues/36955074.

    0 讨论(0)
  • 2020-12-09 21:48

    October 2019 Update

    Support for Scripting and Stored Procedures is now in beta (as of October 2019)

    You can submit multiple statements separated with semi-colons and BigQuery is able to run them now

    See example below

    DECLARE max_date TIMESTAMP;
    SET max_date = (
      SELECT MAX(_PARTITIONTIME) FROM project.dataset.partitioned_table`);
    
    SELECT * FROM `project.dataset.partitioned_table`
    WHERE _PARTITIONTIME = max_date;
    

    Update for those who like downvoting without checking context, etc.

    I think, this answer was accepted because it addressed the OP's main question Is there a way I can pull data only from the latest partition in BigQuery? and in comments it was mentioned that it is obvious that BQ engine still scans ALL rows but returns result based on ONLY recent partition. As it was already mentioned in comment for question - Still something that easily to be addressed by having that logic scripted - first getting result of subquery and then use it in final query

    Try

    SELECT * FROM [dataset.partitioned_table]
    WHERE _PARTITIONTIME IN (
      SELECT MAX(TIMESTAMP(partition_id))
      FROM [dataset.partitioned_table$__PARTITIONS_SUMMARY__]
    )  
    

    or

    SELECT * FROM [dataset.partitioned_table]
    WHERE _PARTITIONTIME IN (
      SELECT MAX(_PARTITIONTIME) 
      FROM [dataset.partitioned_table]
    )  
    
    0 讨论(0)
  • 2020-12-09 21:52

    You can leverage the __TABLES__ list of tables to avoid re-scanning everything or having to hope latest partition is ~3 days ago. I did the split and ordinal stuff to guard against in case my table prefix appears more than once in the table name for some reason.

    This should work for either _PARTITIONTIME or _TABLE_SUFFIX.

    select * from `project.dataset.tablePrefix*` 
    where _PARTITIONTIME = (
        SELECT split(table_id,'tablePrefix')[ordinal(2)] FROM `project.dataset.__TABLES__` 
        where table_id like 'tablePrefix%'
        order by table_id desc limit 1)
    
    0 讨论(0)
  • 2020-12-09 22:03

    I found the workaround to this issue. You can use with statement, select last few partitions and filter out the result. This is I think better approach because:

    1. You are not limited by fixed partition date (like today - 1 day). It will always take the latest partition from given range.
    2. It will only scan last few partitions and not whole table.

    Example with last 3 partitions scan:

    WITH last_three_partitions as (select *, _PARTITIONTIME as PARTITIONTIME 
        FROM dataset.partitioned_table 
        WHERE  _PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 3 DAY))
    SELECT col1, PARTITIONTIME from last_three_partitions 
    WHERE PARTITIONTIME = (SELECT max(PARTITIONTIME) from last_three_partitions)
    
    0 讨论(0)
  • 2020-12-09 22:07

    A compromise that manages to query only a few partitions without resorting to scripting or failing with missing partitions for fixed dates.

    WITH latest_partitions AS (
      SELECT *, _PARTITIONDATE AS date
      FROM `myproject.mydataset.mytable`
      WHERE _PARTITIONDATE > DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
    )
    SELECT
      *
    FROM
      latest_partitions
    WHERE
      date = (SELECT MAX(date) FROM latest_partitions)
    
    0 讨论(0)
  • 2020-12-09 22:08

    Sorry for digging up this old question, but it came up in a Google search and I think the accepted answer is misleading.

    As far as I can tell from the documentation and running tests, the accepted answer will not prune partitions because a subquery is used to determine the most recent partition:

    Complex queries that require the evaluation of multiple stages of a query in order to resolve the predicate (such as inner queries or subqueries) will not prune partitions from the query.

    So, although the suggested answer will deliver the results you expect, it will still query all partitions. It will not ignore all older partitions and only query the latest.

    The trick is to use a more-or-less-constant to compare to, instead of a subquery. For example, if _PARTITIONTIME isn't irregular but daily, try pruning partitions by getting yesterdays partition like so:

    SELECT * FROM [dataset.partitioned_table]
        WHERE _PARTITIONDATE = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
    

    Sure, this isn't always the latest data, but in my case this happens to be close enough. Use INTERVAL 0 DAY if you want todays data, and don't care that the query will return 0 results for the part of the day where the partition hasn't been created yet.

    I'm happy to learn if there is a better workaround to get the latest partition!

    0 讨论(0)
提交回复
热议问题