Delete/update table entries by joining 2 tables on Google BigQuery without import/export

后端 未结 2 708
盖世英雄少女心
盖世英雄少女心 2021-01-14 14:19

We have a usecase where we have hundreds of millions of entries in a table and have a problem splitting it up further. 99% of operations are append-only. However, we have oc

相关标签:
2条回答
  • 2021-01-14 14:38

    There is relatively simple option we found efficient in similar scenarios with BigQuery.
    It allows to handle queries based on any time based snapshot – as well as query current snapshot

    In short, idea is in having one master table and daily history tables
    During the day - current daily table is used for insertions (new, update, delete) and then daily process does merge of last completed daily table with master table writing it out back to same master table. Of course, first, backup is taken via copy of latest master table (free operation).

    Daily master table update process allows to keep master table clean and fresh as of last day.
    Now at any given moment you can have most recent data by querying only (junk-less) master table and today's table only.
    At the same time, as you have all daily tables, you can query any historical data

    Of course, classic option of adding all data (new, update, delete) into the master table with respective qualifiers still looks good both price and performance wise because your main (99%) data are new entries!

    In your case, me personally, I would vote for classic approach with periodic cleaning of historical entries

    Finally, in my mind, it is less about joining, but rather about union with use of table wildcard and window functions

    0 讨论(0)
  • 2021-01-14 14:54

    So to add more on my comment:

    Why don't you just accept the updates as a new row in your table, and have queries that read only the last row from the table? That's much easier.

    Create a view like this:

    select * from (
    SELECT 
    rank() over (partition by user_id order by timestamp desc) as _rank,
    *
    FROM [db.userupdate_last] 
    ) where _rank=1
    

    and update your queries to query the view table and your basic table and you are done.

    Some context how we use this. We have an events table that hold user profile data. On every update we append the complete profile data row again in BQ. That means that we end up having a versioned content with as many rows for that user_id as how many updates they have done. This is all in the same table, and by looking at the time we know the order of the updates. Let's say the table us: [userupdate]. If we do a

    select * from userupdate where user_id=10
    

    it will return all updates made by this user to their profile in random order.

    But we created a view, which we created only once, and the syntax is above. And now when we:

    select * from userupdate_last where user_id=10 #notice the table name changed to view name
    

    it will return only 1 row, the last row of the user. And we have queries where we just swap the table name to view name, if we want to query from a table holding a bunch of append only rows only the last one.

    0 讨论(0)
提交回复
热议问题