问题
I have a DAG running since few months and from last one week it's behaving abnormal. i am running a bash operator which is executing a shell script and in shell script we have a hive query. no of retries set to 4 as below.
default_args = { 'owner': 'airflow', 'depends_on_past': False, 'email': ['airflow@example.com'], 'email_on_failure': False, 'email_on_retry': False, 'retries': 4, 'retry_delay': timedelta(minutes=5) }
i can see in the log that it's triggering the hive query and loosing the heartbeats after some time(around 5 to 6 minutes) and going for the retry. Yarn is showing that query is not yet finished but airflow triggered the next run. now in the yarn 2 queries are running (one for the first run and second for the retry) for the same task.similarly this dag is triggering 5 queries(as retry is 4) for the same task and showing the failed status in the last. Interesting point is that the same dag was running fine from long time. also, this is the issue will all the dags related to hive in the production. today i upgraded to latest version of airflow v 1.10.9. I am using LocalExecuter in this case.
Did anyone have faced the similar issue?
回答1:
Airflow UI doesn't initiate the retries on its own, irrespective of whether it's connected to backend DB or not. It seems like your task executors are going Zombie, in that case Scheduler's Zombie detection kicks in and call the task instances (TI's) handle_failure method. So in nutshell, you can override that method in your dag and add some logging to see at what's happening, in fact you should be able to Hadoop RM and check the status of your job and make a decision accordingly including canceling retries.
For example, see this code, which I wrote to handle Zombie failures only.
来源:https://stackoverflow.com/questions/61124982/airflow-dag-is-running-for-all-the-retries