schedule_interval and other gotchas with SubDagOperator

为君一笑 提交于 2019-12-11 06:04:21

问题


Airflow documentation clearly states

SubDAGs must have a schedule and be enabled. If the SubDAG’s schedule is set to None or @once, the SubDAG will succeed without having done anything

Although we must stick to the documenation, I've found they work without a hiccup even with schedule_interval set to None or @once. Here's my working example.


My current understanding (I heard about Airflow only 2 weeks back) of SubDagOperators (or subdags) is

  • Airflow treats a subdag as just another task
  • They can cause deadlock but easy workarounds exist

My questions are

  • Why does my example work when it shouldn't?
  • Why shouldn't my example work (as per the docs) in the first place?
  • Any subtle differences between behaviour of SubDagOperator and other operators?
  • When solutions of known problems exist, why is there so much uproar against SubDagOperators?

I'm using puckel/docker-airflow with

  • Airflow 1.9.0-4
  • Python 3.6-slim
  • CeleryExecutor with redis:3.2.7

回答1:


If you are just running your DAG once, then you probably won't have any issues with SubDags (as in your example) - especially if you have a bunch of worker slots available. Try letting a few DagRuns of your example accumulate and see if everything runs smoothly if you try to delete and re-run some.

The community has advised moving away from SubDags because unexpected behavior starts happening when you need to re-run old DagRuns or run bigger backfills.

It is not so much that the DAG won't work, but more that unexpected can happen that may affect your workflows that isn't worth the risk when all you are getting in return is a nicer looking DAG.

Even though known solutions exist, implementing them may not be worth the effort.



来源:https://stackoverflow.com/questions/51301763/schedule-interval-and-other-gotchas-with-subdagoperator

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!