Cloud SQL to BigQuery incrementally

偶尔善良 提交于 2020-02-22 22:40:27

问题


I need some suggestions for one of the use cases I am working on.

Use Case:

We have data in Cloud SQL around 5-10 tables, some are treated as lookup and others transactional. We need to get this to BigQuery in a way to make 3-4 tables(Flattened, Nested or Denormalized) out of these which will be used for reporting in Data Studio, Looker, etc.

Data should be processed incrementally and changes in Cloud SQL could happen every 5 min, which means that data should be available to BigQuery reporting in near real-time. If possible we should also maintain the history of Data change.

Solution:

We kept 3 layers in BigQuery, so data from Cloud SQL comes to the first layer, then after flattening we save it to the second layer (which maintains history), and then processes it to insert in the third layer(Reporting). The problem with this approach is that it was developed for daily processing and can not help in processing every 5 min change of data in Cloud SQL.

Any suggestion is welcome.

Note:- We would require to use Dataflow, VM or some other way to do some logic processing, means we can not use only scheduled query as we can not use query to process all logic, but yes mix of view and code can be used.


回答1:


This is more straightforward than you might think: BigQuery can read data straight out of your Cloud SQL.

Docs:

  • https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries

More reading:

  • https://medium.com/google-cloud/loading-mysql-backup-files-into-bigquery-straight-from-cloud-sql-d40a98281229

So all you need to do is write a MySQL query that finds all the changes from the last 5 minutes. You can run this query from within BigQuery - and materialize all the data into BigQuery with one query like:

INSERT INTO `dataset.table`

SELECT * 
FROM EXTERNAL_QUERY(
  'your.CloudSql.database'
  , 'SELECT * FROM users')



回答2:


That's an interesting use case! If BQ external queries are not your jam, we recently built a solution to 'sync' tables from MySQL with tables from BigQuery.

The solution works by using two components:

  1. A connector developed with Debezium to publish changes from MySQL to PubSub
  2. A Dataflow pipeline that consumes those changes from PubSub, and pushes them to BigQuery

Refer to this README to figure out how to deploy it: https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/master/v2/cdc-parent/

Let me know if it's helpful!



来源:https://stackoverflow.com/questions/59941821/cloud-sql-to-bigquery-incrementally

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!