Big table vs Big Query usecase for timeseries data

后端 未结 1 1979
日久生厌
日久生厌 2021-02-10 10:09

I am looking to finalize on Big table vs Big Query for my usecase of timeseries data.

I had gone through https://cloud.google.com/bigtable/docs/schema-design-time-series

相关标签:
1条回答
  • 2021-02-10 10:47

    (I'm an engineer on the Cloud Bigtable Team)

    As you've discovered from our docs, the row key format is the biggest decision you make when using Bigtable, as it determines which access patterns can be performed efficiently. Using visitorKey + cookie as a prefix before the timestamp sounds to me like it would avoid hotspotting issues, as there are almost certainly many more visitors to your site than there would be nodes in your cluster. Bigtable serves these sorts of time-series use cases all the time!

    However, you're also coming from a SQL architecture, which isn't always a good fit for Bigtable's schema/query model. So here are some questions to get you started:

    • Are you planning to perform lots of ad hoc queries like "SELECT A FROM Bigtable WHERE B=x"? If so, strongly prefer BigQuery. Bigtable can't support this query without performing a full table scan. And in general Bigtable is geared more towards streaming back a simple subset of the data quickly, say, to a Dataflow job, rather than embedding complex processing in the queries themselves.
    • Will you require multi-row OLTP transactions? Again, use BigQuery, as Bigtable only supports transactions within a single row.
    • Are you streaming in new events at high QPS? Bigtable is much better for these sorts of high-volume updates. Remember that Bigtable's original purpose was as a random access sink for web crawler updates in Google's search index!
    • Do you want to perform any sort of large-scale complex transformations on the data? Again, Bigtable is likely better here, as you can stream data out and back in faster and let custom business logic in a Dataflow job do whatever you want to it.

    You can also combine the two services if you need some combination of these features. For example, say you're receiving high-volume updates all the time, but want to be able to perform complex ad hoc queries. If you're alright working with a slightly delayed version of the data, it could make sense to write the updates to Bigtable, then periodically scan the table using Dataflow and export a post-processed version of the latest events into BigQuery. GCP also allows BigQuery to serve queries directly from Bigtable in a some regions: https://cloud.google.com/bigquery/external-data-bigtable

    0 讨论(0)
提交回复
热议问题