I want a “materialized view” of the latest records

前端 未结 1 859
予麋鹿
予麋鹿 2020-12-03 19:13

As I keep appending rows to BigQuery, I\'d like to have a \"materialized view\" of the latest row for each id.

How can I do that?

相关标签:
1条回答
  • 2020-12-03 19:36

    2018-10: BigQuery doesn't support materialized views, but you can use this approach:

    Let's say you want a table with the latest info for each row, and you want to keep it updated - so anyone querying can easily access the latest row without having to scan the whole append-only table.

    For this example I'll use my Wikipedia clustered logs - and I'll create a table with the latest rows of all English pages that start with 'A'. These restrictions make my queries faster and smaller for this demo purposes.

    Let's create the table first:

    CREATE TABLE `wikipedia_vt.just_latest_rows` AS
    SELECT latest_row.* 
    FROM (
      SELECT ARRAY_AGG(a ORDER BY datehour DESC LIMIT 1)[OFFSET(0)] latest_row
      FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
      WHERE datehour BETWEEN "2018-10-18" AND "2018-10-21" 
      AND wiki='en' AND title LIKE 'A%'
      GROUP BY title
    )
    

    And now I want to update it with all the new rows received since that date:

    MERGE `wikipedia_vt.just_latest_rows` T
    # our "materialized view"
    USING  (
      SELECT latest_row.* 
      FROM (
        SELECT ARRAY_AGG(a ORDER BY datehour DESC LIMIT 1)[OFFSET(0)] latest_row
        FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
        WHERE datehour > TIMESTAMP_SUB(@run_time, INTERVAL 1 DAY )
        # change to CURRENT_TIMESTAMP() or let scheduled queries do it
        AND datehour > '2000-01-01' # nag
        AND wiki='en' AND title LIKE 'A%'
        GROUP BY title
      )
    ) S
    ON T.title = S.title
    
    WHEN MATCHED THEN
      # if the row is there, we update the views and time
      UPDATE SET views = S.views, datehour=S.datehour
    
    WHEN NOT MATCHED BY TARGET THEN
      # if the row is not there, we insert it 
      INSERT (datehour, wiki, title, views) VALUES (datehour, wiki, title, views)
    

    Now you should set up a process to run this query periodically. To keep querying costs down, make sure the process changes the starting date for updates.

    A simple way to set up this process is to use the new BigQuery Scheduled Queries, which will replace @run_time with the current timestamp.

    To create a view that combines this approach with a real-time view of the latest records, see:

    • I want "live materialized views", with the latest info for any row

    0 讨论(0)
提交回复
热议问题