问题
I have a use case where I need to input data from google Cloud Storage bucket as soon as its made available in the form of a new file in a storage bucket via Dataflow .
How do I trigger the execution of the Dataflow job as soon as the new data(file) becomes available or added to the storage bucket ?
回答1:
If your pipelines are written in Java, then you can use Cloud Functions and Dataflow templating.
I'm going to assume you're using 1.x SDK (it's also possible with 2.x)
- Write your Pipeline and specify the "TemplatingDataflowPipelineRunner" as the runner
- Write a Cloud Function that is set up to listen and react to new objects (in this case CSV files) that arrive into your bucket.
- The Cloud Function kicks off the Dataflow pipeline, and passes the name of the new file as a parameter to it.
See here for a walkthrough on how to build this pipeline. Full disclosure: I work for Shine.
来源:https://stackoverflow.com/questions/43786052/how-to-ingest-data-from-a-gcs-bucket-via-dataflow-as-soon-as-a-new-file-is-put-i