问题
The main object
A python streaming pipeline in which I read the input from pub/sub.
After the input is analyzed, two option are available:
- If x=1 -> insert
- If x=2 -> update
Testing
- This can not be done using apache beam function, so you need to develop it using the 0.25 API of BigQuery (currently this is the version supported in Google Dataflow)
The problem
The inserted record are still in the BigQuery buffer, so the update statement fail:
UPDATE or DELETE statement over table table would affect rows in the streaming buffer, which is not supported
The code
Insert
def insertCanonicalBQ(input):
from google.cloud import bigquery
client = bigquery.Client(project='project')
dataset = client.dataset('dataset')
table = dataset.table('table' )
table.reload()
table.insert_data(
rows=[[values]])
Update
def UpdateBQ(input):
from google.cloud import bigquery
import uuid
import time
client = bigquery.Client()
STD= "#standardSQL"
QUERY= STD + "\n" + """UPDATE table SET field1 = 'XXX' WHERE field2= 'YYY'"""
client.use_legacy_sql = False
query_job = client.run_async_query(query=QUERY, job_name='temp-query-job_{}'.format(uuid.uuid4())) # API request
query_job.begin()
while True:
query_job.reload() # Refreshes the state via a GET request.
if query_job.state == 'DONE':
if query_job.error_result:
raise RuntimeError(query_job.errors)
print "done"
return input
time.sleep(1)
回答1:
Even if the row wasn't in the streaming buffer, this still wouldn't be the way to approach this problem in BigQuery. BigQuery storage is better suited for bulk mutations rather than mutating individual entities like this via UPDATE
. Your pattern is aligned with something I'd expect from an transactional rather than analytical use case.
Consider an append-based pattern for this. Each time you process an entity message write it to BigQuery via streaming insert. Then, when needed you can get the latest version of all entities via a query.
As an example, let's assume an arbitrary schema: idfield
is your unique entity key/identifier, and message_time
represents the time the message was emitted. Your entities may have many other fields. To get the latest version of the entities, we could run the following query (and possibly write this to another table):
#standardSQL
SELECT
idfield,
ARRAY_AGG(
t ORDER BY message_time DESC LIMIT 1
)[OFFSET(0)].* EXCEPT (idfield)
FROM `myproject.mydata.mytable` AS t
GROUP BY idfield
An additional advantage of this approach is that it also allows you to perform analysis at arbitrary points of time. To perform an analysis of the entities as of their state an hour ago would simply involve adding a WHERE clause: WHERE message_time <= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
来源:https://stackoverflow.com/questions/53535840/google-dataflow-insert-update-in-bigquery-in-a-streaming-pipeline