I have a production setup for running celery workers for making a POST / GET request to remote service and storing result, It is handling load around 20k tasks per 15 min.
If your celery worker get stuck sometimes, you can use strace & lsof
to find out at which system call it get stuck.
For example:
$ strace -p 10268 -s 10000
Process 10268 attached - interrupt to quit
recvfrom(5,
10268 is the pid of celery worker, recvfrom(5
means the worker stops at receiving data from file descriptor.
Then you can use lsof
to check out what is 5
in this worker process.
lsof -p 10268
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
......
celery 10268 root 5u IPv4 828871825 0t0 TCP 172.16.201.40:36162->10.13.244.205:wap-wsp (ESTABLISHED)
......
It indicates that the worker get stuck at a tcp connection(you can see 5u
in FD
column).
Some python packages like requests
is blocking to wait data from peer, this may cause celery worker hangs, if you are using requests
, please make sure to set timeout
argument.
Have you seen this page:
https://www.caktusgroup.com/blog/2013/10/30/using-strace-debug-stuck-celery-tasks/