问题
I have a single container Airflow setup running on Marathon, using the LocalExecutor. I have a health check running that pings the /health
endpoint on the Airflow webserver. It currently has 5 cpus allocated to it and the webserver is running 4 Gunicorn. Last night I had about 25 tasks running concurrently. This caused the health check to fail w/o a helpful error message. The container just received a SIGTERM. I was wondering if anyone could suggest a likely culprit for what caused the health check to fail? Was it CPU contention? Did I not create enough gunicorn workers so that they could respond to the health check request? I have a few ideas, but I'm not certain as to the cause.
Here's the health check configuration in Marathon:
[
{
"gracePeriodSeconds": 300,
"intervalSeconds": 60,
"timeoutSeconds": 20,
"maxConsecutiveFailures": 3,
"portIndex": 0,
"path": "/admin/",
"protocol": "HTTP",
"ignoreHttp1xx": false
}
]
回答1:
Yep I have seen similar issues before, would it be possible to migrate away from LocalExecutor and single node Airflow services.
If not, it's a case of vertically scaling your instance, to be able to handle the web request during times of heavy compute requirement from Tasks // Scheduler.
来源:https://stackoverflow.com/questions/62684593/airflow-health-checks-fails-when-too-many-tasks-are-running