Is there any other approach for updating a row in Big Query apart from overwriting the table?

蹲街弑〆低调 提交于 2020-01-12 05:44:06

问题


I have a package data with some of its fields as following:

packageid-->string
status--->string
status_type--->string
scans--->record(repeated)
     scanid--->string
     status--->string
scannedby--->string

Per day, I have a data of 100 000 packages. Total package data size per day becomes 100 MB(approx) and for 1 month it becomes 3GB. For each package, 3-4 updates can come. So do I have to overwrite the package table, every time a package update (e.g. just a change in status field) comes?

Suppose I have data of 3 packages in the table and now the update for 2nd package comes, do I have to overwrite the whole table (deleting and adding the whole data takes 2 transaction per package update)? For 100 000 packages, total transactions will be 10^5 * 10^5 * 2/2.

Is there any other approach for atomic updates without overwriting the table? (as if the table contains 1 million entries and then a package update comes, then overwriting the whole table will be an overhead.)


回答1:


Currently there is no way to update individual rows. We do see this use case somewhat often, and we recommend something similar to what Mikhail suggested. Basically, if you have some unique ID for a logical row, and a timestamp of the update time to the row data, you can simply add every update as a new row, and apply a view over the table to give you the desired rows.

Your view would look something like this:

SELECT *
FROM (
  SELECT
      *,
      MAX(<timestamp_column>)
          OVER (PARTITION BY <id_column>)
          AS max_timestamp,
  FROM <table>
)
WHERE <timestamp_column> = max_timestamp

(cribbed from here Return only the newest rows from a BigQuery table with a duplicate items)

If your table is partitioned into daily tables (or becomes static after some period), you can then replace the view with the result of the view query after the table stabilizes, and improve your query efficiency.

e.g.

  • Add Data to TABLE_RAW.
  • Create view TABLE that performs the above query over TABLE_RAW
  • At some point after TABLE_RAW is stable, query TABLE with a destination table of TABLE, with write disposition WRITE_TRUNCATE.

Unfortunately, this does add a bit of overhead. That said, for your use case you may be able to just leave the view in place indefinitely, which would simplify things a bit.




回答2:


You cannot update row in BigQuery table. You can only add one Overwriting table on each and every transaction - kind of doesn't make sense at all from any prospective

I would suggest just adding each and every transaction as new row.

Meantime, if for any reason (storage cost, query cost, query performance etc.) you want to dedup - you can do batch dedup periodically - let's say daily. In this case, having original data partitioned in daily tables will be beneficial. As at each moment you will need only latest Deduped Table and recent Daily table to query latest transaction. And previous days daily table can be deleted if you worry of storage cost



来源:https://stackoverflow.com/questions/34992649/is-there-any-other-approach-for-updating-a-row-in-big-query-apart-from-overwriti

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!