Google Dataflow: insert + update in BigQuery in a streaming pipeline

问题

The main object

A python streaming pipeline in which I read the input from pub/sub.

After the input is analyzed, two option are available:

If x=1 -> insert
If x=2 -> update

Testing

This can not be done using apache beam function, so you need to develop it using the 0.25 API of BigQuery (currently this is the version supported in Google Dataflow)

The problem

The inserted record are still in the BigQuery buffer, so the update statement fail:

     UPDATE or DELETE statement over table table would affect rows in the streaming buffer, which is not supported

The code

Insert

def insertCanonicalBQ(input):
    from google.cloud import bigquery
    client = bigquery.Client(project='project')
    dataset = client.dataset('dataset')
    table = dataset.table('table' )
    table.reload()
    table.insert_data(
        rows=[[values]])

Update

def UpdateBQ(input):
    from google.cloud import bigquery
    import uuid
    import time
    client = bigquery.Client()
    STD= "#standardSQL"
    QUERY= STD + "\n" + """UPDATE table SET field1 = 'XXX' WHERE field2=  'YYY'"""
    client.use_legacy_sql = False    
    query_job = client.run_async_query(query=QUERY, job_name='temp-query-job_{}'.format(uuid.uuid4()))  # API request
    query_job.begin()
    while True:
         query_job.reload()  # Refreshes the state via a GET request.
         if query_job.state == 'DONE':
             if query_job.error_result:
                 raise RuntimeError(query_job.errors)
             print "done"
             return input
             time.sleep(1)

回答1:

Even if the row wasn't in the streaming buffer, this still wouldn't be the way to approach this problem in BigQuery. BigQuery storage is better suited for bulk mutations rather than mutating individual entities like this via UPDATE. Your pattern is aligned with something I'd expect from an transactional rather than analytical use case.

Consider an append-based pattern for this. Each time you process an entity message write it to BigQuery via streaming insert. Then, when needed you can get the latest version of all entities via a query.

As an example, let's assume an arbitrary schema: idfield is your unique entity key/identifier, and message_time represents the time the message was emitted. Your entities may have many other fields. To get the latest version of the entities, we could run the following query (and possibly write this to another table):

#standardSQL
SELECT
  idfield,
  ARRAY_AGG(
    t ORDER BY message_time DESC LIMIT 1
  )[OFFSET(0)].* EXCEPT (idfield)
FROM `myproject.mydata.mytable` AS t
GROUP BY idfield

An additional advantage of this approach is that it also allows you to perform analysis at arbitrary points of time. To perform an analysis of the entities as of their state an hour ago would simply involve adding a WHERE clause: WHERE message_time <= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)

来源：https://stackoverflow.com/questions/53535840/google-dataflow-insert-update-in-bigquery-in-a-streaming-pipeline

标签

python

google-bigquery

google-cloud-dataflow

apache-beam