Split table into multiple tables based on date using bigquery with a single query for partitioning

前端 未结 2 1913
孤城傲影
孤城傲影 2021-01-27 01:34

The original \"why\" of what I want to do is:

Restore a table maintaining its original partitioning instead of it all going into today\'s partition.

What I thoug

相关标签:
2条回答
  • 2021-01-27 02:15

    Answering myself here. Another approach I've seen done is to write a script that:

    1. Parses the tablebackup.json file, outputs many files tablebackuppartitionYYYYMMDD.json split on a provided parameter.

    2. Creates a batch script to bq load all the files into the appropriate table partitions.

    The script would need to process row by row or chunks to be able to handle massive backups. And would take some time. The advantage of using this method is it would be generic and usable by an untrained-in-BQ sysadmin.

    0 讨论(0)
  • 2021-01-27 02:31

    The main problem here is having full scan for each and every day. The the rest is less of a problem and can be easily scripted out in any client of your choice

    So, below is to - How avoid full table scan for each and every day?

    Try below step-by-step to see the approach
    It is generic enough to extend/apply to your real case - meantime I am using same example as you in your question and I am limiting exercise to just 10 days

    Step 1 – Create Pivot table
    In this step we a) compress each row’s content into record/array and b) put them all into respective ”daily” column

    #standardSQL
    SELECT
      ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160101' THEN r END) AS day20160101,
      ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160102' THEN r END) AS day20160102,
      ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160103' THEN r END) AS day20160103,
      ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160104' THEN r END) AS day20160104,
      ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160105' THEN r END) AS day20160105,
      ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160106' THEN r END) AS day20160106,
      ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160107' THEN r END) AS day20160107,
      ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160108' THEN r END) AS day20160108,
      ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160109' THEN r END) AS day20160109,
      ARRAY_CONCAT_AGG(CASE WHEN d = 'day20160110' THEN r END) AS day20160110
    FROM (
      SELECT d, r, ROW_NUMBER() OVER(PARTITION BY d) AS line
      FROM (
        SELECT 
          stn, CONCAT('day', year, mo, da) AS d, ARRAY_AGG(t) AS r
        FROM `bigquery-public-data.noaa_gsod.gsod2016` AS t 
        GROUP BY stn, d
      ) 
    )
    GROUP BY line
    

    Run above query in Web UI with pivot_table (you can choose whatever name you want here) as a destination

    As you can see - here we will get table with 10 columns – one column for one day and schema of each column is a copy of schema of original table:

    Step 2 – Creating sharded table one-by-one ONLY scanning respective column (no full table scan)

    #standardSQL
    SELECT r.*
    FROM pivot_table, UNNEST(day20160101) AS r
    

    Run above query from Web UI with destination table named mytable_20160101

    You can run same for next day

    #standardSQL
    SELECT r.*
    FROM pivot_table, UNNEST(day20160102) AS r
    

    Now you should have destination table named mytable_20160102 and so on
    You should be able to automate/script this step with any client of your choice Note: those final daily tables will have exactly same schema as original table!

    There are many variations of how you can use above approach - it is up to your creativity

    Note: BigQuery allows up to 10000 columns in table, so 365 columns for respective days of one year is definitely not a problem here :o)

    0 讨论(0)
提交回复
热议问题