Celery monitoring heartbeat tasks flood - 0.06 s apart

erobinson · March 15, 2022, 5:30pm

I've had odd issues with Celery crashing on two servers when running newer builds. I got one server up by reverting to an older production build, however on the other server, using the latest build, celery was crashing. It usually crashes within 10-30 minutes. I tailed the log and noticed that consecutive celery heartbeat tasks are being received 0.06 seconds apart. e.g. below - note this is an edited log with 4 different tasks showing the first of each received task followed by the next of the same type to indicate the time between repeats of the same task):

[2022-03-15 17:01:41,922: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__saved_exports_queue[a6d31388-66e8-413e-ba5f-24de52e9ef55]
[2022-03-15 17:01:41,982: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__saved_exports_queue[46ffe0ee-a7e8-4e90-82d3-c4b3a41b1e92]

[2022-03-15 17:01:41,926: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__analytics_queue[1112cc2f-4773-41a8-8303-05789edfb02d]
[2022-03-15 17:01:41,987: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__analytics_queue[72d70fbe-94ec-4acf-b527-5ca223d5daaa]

[2022-03-15 17:01:41,931: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__case_rule_queue[b5bb20a8-5978-4821-bf9f-7d423587d878]
[2022-03-15 17:01:41,992: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__case_rule_queue[d9a2318a-28c3-44a4-b727-6b2da5fd9012]

[2022-03-15 17:01:41,936: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__background_queue[b4430e0f-ca8e-4619-88af-25171592131a]
[2022-03-15 17:01:41,997: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__background_queue[bf5f538e-cb97-4b31-9f80-a51bb927570a]

Is this expected behavior?

Here's the older build - celery appears stable on this one - each consecutive heartbeat task is around 10 seconds apart:

[2022-03-15 17:20:10,662: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__ucr_queue[cb5c1f23-1248-4a59-8bf6-58e44b9668aa]  
[2022-03-15 17:20:20,723: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__ucr_queue[99929f85-bf7a-41c1-9930-3284368b27d6]  

[2022-03-15 17:20:10,699: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__celery_periodic[8be0e76b-bd31-406f-b474-334e6c324675]  
[2022-03-15 17:20:20,708: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__celery_periodic[cb72e7ca-e606-4fb5-bc4a-755d01aca399]  

[2022-03-15 17:20:10,677: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__saved_exports_queue[de5a520d-2c28-49d6-82b3-80c821b1d9f5]  
[2022-03-15 17:20:20,719: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__saved_exports_queue[bc8fab72-834c-47bd-a407-2edd7871ef93]  

[2022-03-15 17:20:10,681: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__repeat_record_queue[1c03eb23-0d4b-41ce-baee-3306e7cff2ee]  
[2022-03-15 17:20:20,718: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__repeat_record_queue[2d4a6a0f-4de8-45d7-a39b-ebbace5d9fe6]

Is this configurable somewhere? I suspect may be partially responsible for Celery crashing. Not to mention log files growing pretty quickly.
EDIT perhaps this is normal behavior if celery has been down for an extended period and it's playing catch-up?

Thanks!

Simon_Kelly · March 16, 2022, 8:09am

New heartbeat tasks should still be created 10s apart so this is likely a backlog in the RabbitMQ queue.

If the issue persists you could try revoking the heartbeat tasks:

python manage.py revoke_celery_tasks \
corehq.celery_monitoring.heartbeat.heartbeat__repeat_record_queue \
corehq.celery_monitoring.heartbeat.heartbeat__saved_exports_queue \
corehq.celery_monitoring.heartbeat.heartbeat__celery_periodic \
corehq.celery_monitoring.heartbeat.heartbeat__ucr_queue

This command will continue to revoke the tasks as long as it is running. You can terminate it with Ctrl-C.

erobinson · March 16, 2022, 8:26am

Thanks Simon, I think you're on the money. Initially the check-services said OK for celery, but a re-check shows it is 2 days behind. This is useful for situations where a machine has been sitting idle with the celery queue stalled. I do still need to get to the bottom of the queue falling over on at least one system here but that's for a separate post. Thanks for the help on this!

*PS is there somewhere I can check or update RAM available to Celery, or is this not likely RAM related?

Simon_Kelly · March 16, 2022, 9:05am

It could be a RAM issue. There is no way to give celery more RAM other than increasing the RAM available on the machine. If it is a RAM issue then I would expect to see some logs indicating that in syslog

erobinson · March 16, 2022, 9:26am

Simon_Kelly:

python manage.py revoke_celery_tasks \
corehq.celery_monitoring.heartbeat.heartbeat__repeat_record_queue \
corehq.celery_monitoring.heartbeat.heartbeat__saved_exports_queue \
corehq.celery_monitoring.heartbeat.heartbeat__celery_periodic \
corehq.celery_monitoring.heartbeat.heartbeat__ucr_queue

I'll check syslog at around the time it falls over. I noticed it fall over literally a few minutes back just as I did a check-services which I found odd. Thanks @Simon_Kelly

EDIT just to update this, it seems it was indeed the backlog causing this, thanks for the tip on revoking heartbeat tasks @Simon_Kelly
This is resolved