Dataflow job state/Scheduling and Options

你离开我真会死。 提交于 2020-05-17 07:04:28

问题


I am trying to understand Dataflow staging and execution design. It seems like a primary use case is not being supported, but perhaps I am lacking a general understanding of the intended design.

My Goal : I want to execute my Dataflow pipeline on a regular interval as a bounded/batch job. I have an option time range argument that allows me to run the same pipeline for specific historical backfill or on an hourly basis. This argument is supposed to update the BigQuery SQL query in the pipeline.

Result : In the pipeline code, I am attempting to generate a range argument based on system time accessed at the time of the scheduled job. Unfortunately, I have discovered that the time argument used for my BigQuery query is not updating. It seems that it is stuck on the time when the job was first staged. The code in my pipeline run configuration only runs once at the time of staging and does not update with subsequent runs or changes to options values.

Solution Alternatives : There does not appear to be a way to modify the query each time the job is run? Or is there? I understand that there is the unbounded/windowed approach, but I can't find a documented way to achieve this with the BigQuery IO connector. In any case, this only addresses time arguments. What if I wanted to run the same pipeline multiple times but with different query filters? Do I need to stage a bunch of different templates?

For example, suppose I want to do a transform on all the records timestamped within each hour? I am currently using Cloud Scheduler to do an http request directly on the Dataflow api every hour. Perhaps I need to create my own service endpoint that stages/updates the template each time?

But what if I wanted to create multiple scheduled jobs that run the same staged template? What is the point of exposing options if I can't change them on each scheduled run? Since I can always changed the configuration myself at the time I stage the template. (using maven for example) Seems unnecessarily static right?

来源:https://stackoverflow.com/questions/61527719/dataflow-job-state-scheduling-and-options

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!