After a server reboot and check_services, celery reports it's been down for over 18 hours:
(cchq) ccc@monolith:~/commcare-cloud$ cchq monolith django-manage check_services
ssh ccc@10.3.0.4 -t -o UserKnownHostsFile=/home/ccc/environments/monolith/known_hosts 'sudo -iu cchq bash -c '"'"'cd /home/cchq/www/monolith/current; python_env/bin/python manage.py check_services'"'"''
Ubuntu 22.04.3 LTS
SUCCESS (Took 0.11s) kafka : Kafka seems to be in order
SUCCESS (Took 0.00s) redis : Redis is up and using 4.83M memory
SUCCESS (Took 0.02s) postgres : default:commcarehq:OK p1:commcarehq_p1:OK p2:commcarehq_p2:OK proxy:commcarehq_proxy:OK synclogs:commcarehq_synclogs:OK ucr:commcarehq_ucr:OK Successfully got a user from postgres
SUCCESS (Took 0.01s) couch : Successfully queried an arbitrary couch view
FAILURE (Took 0.00s) celery : analytics_queue has been blocked for 18:49:59.521897 (max allowed is 0:30:00)
async_restore_queue has been blocked for 18:50:08.388927 (max allowed is 0:01:00)
background_queue has been blocked for 18:49:59.264292 (max allowed is 0:10:00)
case_import_queue has been blocked for 18:49:58.454671 (max allowed is 0:01:00)
case_rule_queue has been blocked for 18:50:08.378536 (max allowed is 1:00:00)
celery has been blocked for 18:50:08.366363 (max allowed is 0:01:00)
celery_periodic has been blocked for 18:49:59.451630 (max allowed is 0:10:00)
email_queue has been blocked for 18:49:58.556194 (max allowed is 0:00:30)
export_download_queue has been blocked for 18:50:00.925046 (max allowed is 0:00:30)
repeat_record_queue has been blocked for 18:49:59.579119 (max allowed is 1:00:00)
saved_exports_queue has been blocked for 18:49:59.464947 (max allowed is 6:00:00)
ucr_queue has been blocked for 18:50:08.353426 (max allowed is 1:00:00)
SUCCESS (Took 0.03s) elasticsearch : Successfully sent a doc to ES and read it back
SUCCESS (Took 0.05s) blobdb : Successfully saved a file to the blobdb
SUCCESS (Took 0.02s) formplayer : Formplayer returned a 200 status code: https://inddex24.org/formplayer/serverup
SUCCESS (Took 0.00s) rabbitmq : RabbitMQ OK
Connection to 10.3.0.4 closed.
No change after a
cchq monolith service celery restart
sudo supervisorctl status
reports everything is running:
commcare-hq-monolith-celery_flower RUNNING pid 35442, uptime 0:23:51
commcare-hq-monolith-celery_repeat_record_queue,celery,case_import_queue,background_queue,export_download_queue,saved_exports_queue,analytics_queue,ucr_queue,async_restore_queue,email_queue,case_rule_queue,celery_periodic_0 RUNNING pid 35589, uptime 0:23:41
commcare-hq-monolith-celerybeat RUNNING pid 51920, uptime 0:00:10
commcare-hq-monolith-django RUNNING pid 25165, uptime 0:37:08
commcare-hq-monolith-formsplayer-spring RUNNING pid 33958, uptime 0:25:24
commcare-hq-monolith-pillowtop-AppDbChangeFeedPillow-0 RUNNING pid 29954, uptime 0:28:36
commcare-hq-monolith-pillowtop-ApplicationToElasticsearchPillow-0 RUNNING pid 29955, uptime 0:28:36
commcare-hq-monolith-pillowtop-CacheInvalidatePillow-0 RUNNING pid 29956, uptime 0:28:36
commcare-hq-monolith-pillowtop-DefaultChangeFeedPillow-0 RUNNING pid 29957, uptime 0:28:36
commcare-hq-monolith-pillowtop-DomainDbKafkaPillow-0 RUNNING pid 29958, uptime 0:28:36
commcare-hq-monolith-pillowtop-KafkaDomainPillow-0 RUNNING pid 29959, uptime 0:28:36
commcare-hq-monolith-pillowtop-LedgerToElasticsearchPillow-0 RUNNING pid 29960, uptime 0:28:36
commcare-hq-monolith-pillowtop-SqlSMSPillow-0 RUNNING pid 29961, uptime 0:28:36
commcare-hq-monolith-pillowtop-UpdateUserSyncHistoryPillow-0 RUNNING pid 29962, uptime 0:28:36
commcare-hq-monolith-pillowtop-UserCacheInvalidatePillow-0 RUNNING pid 29963, uptime 0:28:36
commcare-hq-monolith-pillowtop-UserGroupsDbKafkaPillow-0 RUNNING pid 29964, uptime 0:28:36
commcare-hq-monolith-pillowtop-case-pillow-0 RUNNING pid 29965, uptime 0:28:36
commcare-hq-monolith-pillowtop-group-pillow-0 RUNNING pid 29966, uptime 0:28:36
commcare-hq-monolith-pillowtop-user-pillow-0 RUNNING pid 29967, uptime 0:28:36
commcare-hq-monolith-pillowtop-xform-pillow-0 RUNNING pid 29968, uptime 0:28:36
If I open a python session and run:
python manage.py show_celery_tasks celery@10.3.0.4_repeat_record_queue,celery,case_import_queue,background_queue,export_download_queue,saved_exports_queue,analytics_queue,ucr_queue,async_restore_queue,email_queue,case_rule_queue,celery_periodic_0.1701687937_timestamp active
I get:
(none)
The celery log looks like this:
2023-12-04 11:05:51,012 INFO [celery.apps.worker] celery@10.3.0.4_repeat_record_queue,celery,case_import_queue,background_queue,export_download_queue,saved_exports_queue,analytics_queue,ucr_queue,async_restore_queue,email_queue,case_rule_queue,celery_periodic_0.1701687937_timestamp ready.
2023-12-04 11:10:03,390 INFO [celery.worker.strategy] Task corehq.apps.analytics.tasks._track_workflow_task[982c4a75-1817-4ba5-9773-45b8d07b1aaf] received
2023-12-04 11:10:03,391 INFO [celery.worker.autoscale] Scaling up 1 processes.
2023-12-04 11:10:03,700 INFO [celery.app.trace] Task corehq.apps.analytics.tasks._track_workflow_task[982c4a75-1817-4ba5-9773-45b8d07b1aaf] succeeded in 0.004233837999890966s: None
2023-12-04 11:10:51,022 INFO [celery.worker.autoscale] Scaling down 1 processes.
2023-12-04 11:15:03,192 INFO [celery.worker.strategy] Task corehq.apps.analytics.tasks._track_workflow_task[864bed90-daa3-4236-a867-512310268bdd] received
2023-12-04 11:15:03,193 INFO [celery.worker.autoscale] Scaling up 1 processes.
2023-12-04 11:15:03,512 INFO [celery.app.trace] Task corehq.apps.analytics.tasks._track_workflow_task[864bed90-daa3-4236-a867-512310268bdd] succeeded in 0.004273338999837506s: None
2023-12-04 11:15:51,033 INFO [celery.worker.autoscale] Scaling down 1 processes.
2023-12-04 11:20:03,141 INFO [celery.worker.strategy] Task corehq.apps.analytics.tasks._track_workflow_task[be2e0719-869d-483d-b05a-adb9785e8153] received
2023-12-04 11:20:03,141 INFO [celery.worker.autoscale] Scaling up 1 processes.
2023-12-04 11:20:03,477 INFO [celery.app.trace] Task corehq.apps.analytics.tasks._track_workflow_task[be2e0719-869d-483d-b05a-adb9785e8153] succeeded in 0.005137547000003906s: None
2023-12-04 11:20:51,046 INFO [celery.worker.autoscale] Scaling down 1 processes.
2023-12-04 11:23:02,150 INFO [celery.worker.strategy] Task corehq.apps.users.tasks.update_domain_date[fe2d1a47-5fb8-4653-9f80-6949b3ee3df8] received
2023-12-04 11:23:02,151 INFO [celery.worker.autoscale] Scaling up 1 processes.
2023-12-04 11:23:02,772 INFO [celery.app.trace] Task corehq.apps.users.tasks.update_domain_date[fe2d1a47-5fb8-4653-9f80-6949b3ee3df8] succeeded in 0.2613754040000913s: None
2023-12-04 11:23:51,051 INFO [celery.worker.autoscale] Scaling down 1 processes.
2023-12-04 11:25:03,113 INFO [celery.worker.strategy] Task corehq.apps.analytics.tasks._track_workflow_task[9b1beb95-4899-4d58-b6a1-fca3643dcdc2] received
2023-12-04 11:25:03,113 INFO [celery.worker.autoscale] Scaling up 1 processes.
2023-12-04 11:25:03,485 INFO [celery.app.trace] Task corehq.apps.analytics.tasks._track_workflow_task[9b1beb95-4899-4d58-b6a1-fca3643dcdc2] succeeded in 0.004864945000008447s: None
2023-12-04 11:25:51,055 INFO [celery.worker.autoscale] Scaling down 1 processes.
2023-12-04 11:30:03,539 INFO [celery.worker.strategy] Task corehq.apps.analytics.tasks._track_workflow_task[938a32a4-035f-4d18-a670-a0abc8c927a0] received
2023-12-04 11:30:03,539 INFO [celery.worker.autoscale] Scaling up 1 processes.
2023-12-04 11:30:03,948 INFO [celery.app.trace] Task corehq.apps.analytics.tasks._track_workflow_task[938a32a4-035f-4d18-a670-a0abc8c927a0] succeeded in 0.005046146000040608s: None
2023-12-04 11:30:51,068 INFO [celery.worker.autoscale] Scaling down 1 processes.
2023-12-04 11:35:03,112 INFO [celery.worker.strategy] Task corehq.apps.analytics.tasks._track_workflow_task[809026c2-af06-4460-b077-1751e0509efa] received
2023-12-04 11:35:03,112 INFO [celery.worker.autoscale] Scaling up 1 processes.
2023-12-04 11:35:03,469 INFO [celery.app.trace] Task corehq.apps.analytics.tasks._track_workflow_task[809026c2-af06-4460-b077-1751e0509efa] succeeded in 0.005102047000036691s: None
2023-12-04 11:35:51,081 INFO [celery.worker.autoscale] Scaling down 1 processes.
It's now 11:35 UTC so those are fairly recent activities and every now and then something else pops up on the log, but I'm not seeing the regular flurry of heartbeat entries I'm familiar with.
Any ideas very welcome!
Thanks