问题
I am trying to understand Dataflow staging and execution design. It seems like a primary use case is not being supported, but perhaps I am lacking a general understanding of the intended design.
My Goal : I want to execute my Dataflow pipeline on a regular interval as a bounded/batch job. I have an option time range argument that allows me to run the same pipeline for specific historical backfill or on an hourly basis. This argument is supposed to update the BigQuery SQL query in the pipeline.
Result : In the pipeline code, I am attempting to generate a range argument based on system time accessed at the time of the scheduled job. Unfortunately, I have discovered that the time argument used for my BigQuery query is not updating. It seems that it is stuck on the time when the job was first staged. The code in my pipeline run configuration only runs once at the time of staging and does not update with subsequent runs or changes to options values.
Solution Alternatives : There does not appear to be a way to modify the query each time the job is run? Or is there? I understand that there is the unbounded/windowed approach, but I can't find a documented way to achieve this with the BigQuery IO connector. In any case, this only addresses time arguments. What if I wanted to run the same pipeline multiple times but with different query filters? Do I need to stage a bunch of different templates?
For example, suppose I want to do a transform on all the records timestamped within each hour? I am currently using Cloud Scheduler to do an http request directly on the Dataflow api every hour. Perhaps I need to create my own service endpoint that stages/updates the template each time?
But what if I wanted to create multiple scheduled jobs that run the same staged template? What is the point of exposing options if I can't change them on each scheduled run? Since I can always changed the configuration myself at the time I stage the template. (using maven for example) Seems unnecessarily static right?
来源:https://stackoverflow.com/questions/61527719/dataflow-job-state-scheduling-and-options