Understanding Celery Queue status

erobinson · June 9, 2022, 9:17am

On one of our servers the queue appears to have stalled and needed a restart. The results of /admin/system/check_services produces this:

FAILURE (Took   0.02s) celery         : analytics_queue is delayed for 1:00:28.244477 (max allowed is 0:30:00)
async_restore_queue has been blocked for 0:14:46.044439 (max allowed is 0:01:00)
background_queue has been blocked for 0:14:46.054863 (max allowed is 0:10:00)
case_import_queue has been blocked for 0:14:46.085465 (max allowed is 0:01:00)
case_rule_queue is delayed for 1:00:28.205859 (max allowed is 1:00:00)
celery has been blocked for 0:14:46.066732 (max allowed is 0:01:00)
celery_periodic has been blocked for 0:14:53.330107 (max allowed is 0:10:00)
email_queue has been blocked for 0:14:53.321611 (max allowed is 0:00:30)
export_download_queue has been blocked for 0:14:53.328948 (max allowed is 0:00:30)
repeat_record_queue is delayed for 1:00:28.188084 (max allowed is 1:00:00)
ucr_queue is delayed for 14:33:51.099824 (max allowed is 1:00:00)

I checked it less than 10 minutes ago and it reported OK for celery.
What does that message actually mean?

/serverup.txt, /serverup.txt?heartbeat and /serverup.txt?only=celery all report success and sudo supervisorctl status reports all workers are running

What can I assume from check_services under the circumstances? Should I be restarting Celery?
Also, what is the difference between /serverup.txt, /serverup.txt?heartbeat and /serverup.txt?only=celery

EDIT
So the reason for the blocked queue appears to be the corehq.apps.data_analytics.tasks.build_last_month_MALT task.
Once revoked, the system appears to be back to normal.

Is there a log I can look at to see what could have been the cause of the holdup?
Thanks!

Ethan_Soergel · June 9, 2022, 4:30pm

Also, what is the difference between /serverup.txt, /serverup.txt?heartbeat and /serverup.txt?only=celery

You can find an explanation here

The celery check is defined here. Most of this is based around the "heartbeat" system, where we periodically queue a tiny task for monitoring each celery queue. If it's been a while since any of these tasks have succeeded, the queue is considered "blocked". If there's a long gap between when the most recent task was queued and when it was executed, it's considered "delayed".

I'm not certain what the cause of the blockage would be, but if there were any logs recorded in the execution of that task, they'd be in the celery logs. Run this to see their location:
cchq <env> service celery logs
They would be in the logs for the periodic queue, or possible the malt_generation_queue, if you separate queues out in your environment.

gherceg · June 9, 2022, 7:06pm

The corehq.apps.data_analytics.tasks.build_last_month_MALT task you mentioned does intentionally block until the subtasks that it spawns are all completed. The subtasks are spawned on the malt_generation_queue, which doesn't appear to be blocked in the services check, so it's possible the subtasks are still running, causing the parent task (build_last_month_MALT) to still block. This is a not a best practice use of celery, so we created this PR to remove the need for this task to block in the future.

erobinson · June 10, 2022, 5:23pm

Thanks Ethan and Greg, that's very helpful.
Ed