Celery failure on monolith

Celery appears to be having trouble starting on our server. I tried restarting the service and rebooting the server. We updated Commcare HQ on 28 February.

A tail of the log shows the following:

(cchq) ccc@monolith:~/commcare-cloud$ tail  /home/cchq/www/monolith/log/celery_*.log
==> /home/cchq/www/monolith/log/celery_flower.log <==
    mod = importlib.import_module(self.SETTINGS_MODULE)
  File "/usr/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/home/cchq/www/monolith/releases/2022-02-28_03.59/settings.py", line 1601, in <module>
    SHARED_DRIVE_CONF = helper.SharedDriveConfiguration(
  File "/home/cchq/www/monolith/releases/2022-02-28_03.59/settingshelper.py", line 27, in __init__
    self.blob_dir = self._init_dir(blob_dir)
  File "/home/cchq/www/monolith/releases/2022-02-28_03.59/settingshelper.py", line 35, in _init_dir
    os.mkdir(path)
FileExistsError: [Errno 17] File exists: '/opt/shared_monolith/blobdb'

==> /home/cchq/www/monolith/log/celery_repeat_record_queue,celery,case_import_queue,background_queue,export_download_queue,saved_exports_queue,analytics_queue,ucr_queue,async_restore_queue,email_queue,case_rule_queue,celery_periodic_0.log <==
[2022-03-05 13:52:50,033: INFO/MainProcess] Received task: pillowtop.tasks.pillow_datadog_metrics[b057a50f-49c8-4875-b358-faadfdbdd6c3]  
[2022-03-05 13:52:50,048: INFO/ForkPoolWorker-1] Task corehq.preindex.tasks.run_continuous_indexing_task[6675cdf9-9e20-4d8d-b613-dc8c2affdc62] succeeded in 0.014429542003199458s: None
[2022-03-05 13:52:50,051: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__celery[58524cb0-2cca-4c2a-8d78-2cdd4da9f4d0]  
[2022-03-05 13:52:50,053: INFO/ForkPoolWorker-1] Task corehq.celery_monitoring.heartbeat.heartbeat__celery[856254e5-2527-41b9-a954-b934d334f778] succeeded in 0.0020082060000277124s: None
[2022-03-05 13:52:50,056: INFO/MainProcess] Received task: corehq.apps.registration.tasks.activation_24hr_reminder_email[4d7a694d-2557-4549-9816-28c7d15431f5]  
[2022-03-05 13:52:50,059: INFO/ForkPoolWorker-1] corehq.apps.callcenter.tasks.calculate_indicators[524c4775-c7a4-4d1a-8ec0-b51137c928e0]: Calculating callcenter indicators for domains:
[]
[2022-03-05 13:52:50,059: INFO/ForkPoolWorker-1] Task corehq.apps.callcenter.tasks.calculate_indicators[524c4775-c7a4-4d1a-8ec0-b51137c928e0] succeeded in 0.0032459089998155832s: None
[2022-03-05 13:52:50,062: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__celery_periodic[ad43f579-bcc4-401a-87fb-585b2fc595e3]  
[2022-03-05 13:52:50,766: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__celery[9bdbe954-a1ea-4f42-80bb-32a8651ba67b]

The blobdb symlink mentioned in the log appears as follows:

ls -l /opt/shared_monolith/blobdb
lrwxrwxrwx 1 root root 16 Jun 17  2021 /opt/shared_monolith/blobdb -> /opt/data/blobdb

check_services shows:

FAILURE (Took   0.03s) celery         : background_queue is delayed for 1 day, 1:38:56.223524 (max allowed is 0:10:00)
celery is delayed for 1 day, 0:04:33.729252 (max allowed is 0:01:00)
celery_periodic is delayed for 1:30:25.994082 (max allowed is 0:10:00)

Any assistance is appreciated!

What do you get when you execute:

ls -l /opt/data/blobdb

I suspect it doesn't exist (this is the path that your /opt/shared_monolith/blobdb symbolic link points to). The failure is in settingshelper.py where lines 34 and 35 look like:

if not os.path.exists(path):
    os.mkdir(path)

I think the only way the error you're getting can happen is if your symbolic link is broken. However, I'm not sure that "fixing" the broken symlink it is the correct thing to do, it might be that something has changed that's trying to use the blobdb path now when it shouldn't be (as in, it wasn't used in the past on your monolith). Do you have the exact version (commit hash) of both commcare-hq and commcare-cloud that your Feb 28th update was deployed with?

After digging into the source a bit, I can't see anything related to the blobdb or shared drive configuration that's changed lately, so I'm now thinking that your blobdb directory may actually have gone missing somehow. On your server, is the /opt/data directory a mountpoint (as in, there is an /etc/fstab entry for it)? If not, perhaps the directories inside of it are (e.g. /opt/data/blobdb)? If so, I'd first check to make sure all your mounts are mounted and working as expected (see Checking Mounts below).

If none of these data directories are mountpoints, I suppose one thing you could try is creating the blobdb directory so the link isn't broken anymore. In this case, I'd be a bit curious about why it vanished in the first place, but the following commands will "treat the symptoms":

sudo mkdir /opt/data/blobdb
# You also want to "own" this directory as the user/group that HQ runs as.
# Check whatever user and group your stack uses,
# assuming that's user/group: cchq/cchq
sudo chown cchq:cchq /opt/data/blobdb

Checking Mounts

One way to check this could be to grab the output of your current mounts on the monolith, then attempt to mount all volumes, and see if you get any errors, or otherwise, if anything got mounted. For example:

# grab output of current mounts
mount > mounts-1st.txt
# attempt to mount all volumes
sudo mount -a
# ^ observe for any errors
# grab output of current mounts again
mount > mounts-2nd.txt
# compare for mount changes
diff -u mounts-1st.txt mounts-2nd.txt

Hey Joel, thanks for the responses.
The blobdb directory exists. /opt/data is indeed mounted at boot and is in fstab, so all appears well there.

(cchq) ccc@monolith:~/commcare-cloud$ ls /opt/data -l
total 1067424
-rwxrwx---  1 root          root                  55 Jun 17  2021 README
drwx------  5 root          root                4096 Jun 18  2021 backups
drwxrwxr-x  3 cchq          cchq            19247104 Mar  5 13:59 blobdb
drwx------  7 couchdb       couchdb             4096 Jun 18  2021 couchdb2
drwxr-xr-x  4 elasticsearch elasticsearch       4096 Jun 18  2021 elasticsearch-2.4.6
-rw-r--r--  1 root          root          1073741824 Jun 17  2021 emerg_delete.dummy
drwxr-xr-x 10 cchq          cchq                4096 Feb 14 09:24 formplayer
drwxr-xr-x  3 kafka         kafka               4096 Jun 18  2021 kafka
drwx------  3 postgres      postgres            4096 Jun 17  2021 postgresql
drwxr-xr-x  2 redis         root                4096 Mar  6 08:36 redis

Just to add, the /opt/data mount(s) are as follows:

/dev/sdc1 on /opt/data type ext4 (rw,relatime)
/opt/data on /opt/data type ecryptfs (rw,nosuid,nodev,noexec,relatime,ecryptfs_fnek_sig=5fbdd79f0a602953,ecryptfs_sig=5fbdd79f0a602953,ecryptfs_cipher=aes,ecryptfs_key_bytes=32,ecryptfs_unlink_sigs,user)

Update - good news but a little confusing. Celery appears to have come right on it's own after another restart of the service. I have no idea why but I'm going to leave things as is. On the next server update I'm going to pay special attention to that mount and the symlink.

Thanks for your input Joel!

It looks like I spoke too soon.
Celery ran for about 10-15 minutes or so it seems, processed some items, and then fell over again. Celery logs:

==> /home/cchq/www/monolith/log/celery_flower.log <==
    mod = importlib.import_module(self.SETTINGS_MODULE)
  File "/usr/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/home/cchq/www/monolith/releases/2022-02-28_03.59/settings.py", line 1601, in <module>
    SHARED_DRIVE_CONF = helper.SharedDriveConfiguration(
  File "/home/cchq/www/monolith/releases/2022-02-28_03.59/settingshelper.py", line 27, in __init__
    self.blob_dir = self._init_dir(blob_dir)
  File "/home/cchq/www/monolith/releases/2022-02-28_03.59/settingshelper.py", line 35, in _init_dir
    os.mkdir(path)
FileExistsError: [Errno 17] File exists: '/opt/shared_monolith/blobdb'

==> /home/cchq/www/monolith/log/celery_repeat_record_queue,celery,case_import_queue,background_queue,export_download_queue,saved_exports_queue,analytics_queue,ucr_queue,async_restore_queue,email_queue,case_rule_queue,celery_periodic_0.log <==
[2022-03-06 19:24:45,363: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__celery_periodic[6ffeed95-d364-4044-b741-0fdbb83422c2]  
[2022-03-06 19:24:45,368: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__async_restore_queue[7f97657e-137c-4a50-b4a2-989029d664f2]  
[2022-03-06 19:24:45,373: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__case_rule_queue[4046d904-bd27-49c5-bc04-b4d14f811b60]  
[2022-03-06 19:24:45,378: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__saved_exports_queue[04e4a6eb-fb9c-4092-907d-59fbd8df50ea]  
[2022-03-06 19:24:45,383: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__case_import_queue[94fe26e4-ceee-439d-be67-d365a3844e33]  
[2022-03-06 19:24:45,388: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__background_queue[2130d522-a8c1-4091-ad44-034f945ebef9]  
[2022-03-06 19:24:45,392: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__export_download_queue[be7c77e8-4066-458b-9c61-77c403a324bc]  
[2022-03-06 19:24:45,397: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__analytics_queue[19f08ab1-98f3-4baa-8995-768c58f955c4]  
[2022-03-06 23:15:26,357: INFO/MainProcess] Tasks flagged as revoked: 248b9b9a-e1ea-4418-9561-095a89404e03
[2022-03-07 05:14:55,145: INFO/MainProcess] Tasks flagged as revoked: f53f5773-76e1-401e-ab6e-818d53375a35

The /opt/shared_monolith/blobdb symlink indeed exists:

lrwxrwxrwx 1 root root 16 Jun 17 2021 /opt/shared_monolith/blobdb -> /opt/data/blobdb

and /opt/data/blobdb exists:

drwxrwxr-x 3 cchq cchq 19374080 Mar 7 08:18 blobdb

What I am seeing though is the ownership on the /opt/shared_monolith/blobdb appears to be with root:root - not sure if that is correct.

Hi Ed

This certainly could be a permissions issue. We've had variations on this issue in the past and they've always been

What is the ownership of the /opt/shared_monolith directory?

Also could you copy in the output from running id cchq.

Hey Simon, /opt/shared_monolith:

drwxrwxr-x  7 nobody        nfs               4096 Mar  5 13:24 shared_monolith

Contents of /opt/shared_monolith

lrwxrwxrwx 1 root   root   16 Jun 17  2021 blobdb -> /opt/data/blobdb
drwxr-xr-x 3 cchq   cchq 4096 Mar  5 13:23 blobdb.old
drwxrwxr-x 2 nobody nfs  4096 Jun 18  2021 restore_payloads
drwxrwxr-x 2 nobody nfs  4096 Jun 18  2021 temp
drwxrwxr-x 2 nobody nfs  4096 Mar  7 10:45 transfer_payloads
drwxr-xr-x 2 cchq   cchq 4096 Jun 18  2021 tzmigration-planning

id cchq:

uid=1002(cchq) gid=1003(cchq) groups=1003(cchq),1002(dimagidev),1500(nfs)

Thanks for looking into this!

I think this should be owned either by cchq:cchq or else nobody:nfs

Something odd is happening @Simon_Kelly
I've just checked again and this is the current contents of /opt/shared_monolith:

drwxr-xr-x 2 cchq   cchq 4096 Mar  7 11:46 blobdb
drwxrwxr-x 2 nobody nfs  4096 Jun 18  2021 restore_payloads
drwxrwxr-x 2 nobody nfs  4096 Jun 18  2021 temp
drwxrwxr-x 2 nobody nfs  4096 Mar  7 10:45 transfer_payloads
drwxr-xr-x 2 cchq   cchq 4096 Jun 18  2021 tzmigration-planning

Something must be recreating that symlink?
Celery is still stalled, however and a restart still seems to produce this:

==> /home/cchq/www/monolith/log/celery_flower.log <==
    mod = importlib.import_module(self.SETTINGS_MODULE)
  File "/usr/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/home/cchq/www/monolith/releases/2022-02-28_03.59/settings.py", line 1601, in <module>
    SHARED_DRIVE_CONF = helper.SharedDriveConfiguration(
  File "/home/cchq/www/monolith/releases/2022-02-28_03.59/settingshelper.py", line 27, in __init__
    self.blob_dir = self._init_dir(blob_dir)
  File "/home/cchq/www/monolith/releases/2022-02-28_03.59/settingshelper.py", line 35, in _init_dir
    os.mkdir(path)
FileExistsError: [Errno 17] File exists: '/opt/shared_monolith/blobdb'

==> /home/cchq/www/monolith/log/celery_repeat_record_queue,celery,case_import_queue,background_queue,export_download_queue,saved_exports_queue,analytics_queue,ucr_queue,async_restore_queue,email_queue,case_rule_queue,celery_periodic_0.log <==
[2022-03-07 11:47:47,454: INFO/ForkPoolWorker-1] Task corehq.celery_monitoring.heartbeat.heartbeat__export_download_queue[e3f7c7b7-57bc-4b35-a20a-06cc6a24b550] succeeded in 0.0020390059798955917s: None
[2022-03-07 11:47:47,457: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__async_restore_queue[5b339856-3a63-4dd1-b718-a893b1e993b0]  
[2022-03-07 11:47:47,459: INFO/ForkPoolWorker-1] Task corehq.celery_monitoring.heartbeat.heartbeat__email_queue[0be65eaa-99b4-4f98-ade4-dfe9acbe6d6c] succeeded in 0.0019169049919582903s: None
[2022-03-07 11:47:47,462: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__celery[c9b8840b-da30-4e96-9640-44ec5f8533fd]  
[2022-03-07 11:47:47,464: INFO/ForkPoolWorker-1] Task corehq.celery_monitoring.heartbeat.heartbeat__saved_exports_queue[61816e3d-a96f-4f0e-adf7-deaa212e40c6] succeeded in 0.0025712079950608313s: None
[2022-03-07 11:47:47,467: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__ucr_queue[1c539b16-9b2e-4c56-8223-7581d99b6342]  
[2022-03-07 11:47:47,469: INFO/ForkPoolWorker-1] Task corehq.celery_monitoring.heartbeat.heartbeat__repeat_record_queue[c57c9e6e-15f5-423b-90af-14b90dfe3721] succeeded in 0.0019537050102371722s: None
[2022-03-07 11:47:47,471: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__background_queue[e8d8213d-d4ca-43fd-9d04-7def9fafa7ee]  
[2022-03-07 11:47:47,474: INFO/ForkPoolWorker-1] Task corehq.celery_monitoring.heartbeat.heartbeat__case_import_queue[f3f422dc-b3a2-4e8b-90ac-6ce95366d401] succeeded in 0.002115705981850624s: None
[2022-03-07 11:47:47,476: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__export_download_queue[9b0baa94-958d-4a03-b6e5-81a79f71dac5]

Another interesting observation... after restarting celery, it appears to be processing stuff. I opened another terminal session to keep an eye on the logs and ran the restart.
I got this when I did a check_services:

FAILURE (Took   0.03s) celery         : analytics_queue is delayed for 12:37:39.609377 (max allowed is 0:30:00)
async_restore_queue is delayed for 12:37:39.631663 (max allowed is 0:01:00)
background_queue is delayed for 14:10:18.299161 (max allowed is 0:10:00)
case_import_queue is delayed for 12:37:19.545380 (max allowed is 0:01:00)
case_rule_queue is delayed for 12:37:09.535245 (max allowed is 1:00:00)
celery is delayed for 12:46:02.102218 (max allowed is 0:01:00)
celery_periodic is delayed for 13:28:19.590890 (max allowed is 0:10:00)
email_queue is delayed for 12:37:22.721896 (max allowed is 0:00:30)
export_download_queue is delayed for 12:37:25.653010 (max allowed is 0:00:30)
repeat_record_queue is delayed for 12:37:09.489445 (max allowed is 1:00:00)
saved_exports_queue is delayed for 12:37:19.604560 (max allowed is 6:00:00)
ucr_queue is delayed for 12:36:59.482295 (max allowed is 1:00:00)

The logs indicated it is processing some data and the delay value above went down from 15:xx:xx to 12:xx:xx over the period of a couple of minutes. It's at 7:10:xx now so I assume celery is running fine for now... I'm expecting it to crash again at some point though.

EDIT the queues seemed to have caught up and appeared stable for a short while:

cchq) ccc@monolith:~/commcare-cloud$ cchq monolith django-manage check_services
ssh ccc@10.1.0.4 -t -o UserKnownHostsFile=/home/ccc/environments/monolith/known_hosts 'sudo -iu cchq bash -c '"'"'cd /home/cchq/www/monolith/current; python_env/bin/python manage.py check_services'"'"''
Ubuntu 18.04.5 LTS
SUCCESS (Took   0.61s) kafka          : Kafka seems to be in order
SUCCESS (Took   0.00s) redis          : Redis is up and using 215.40M memory
SUCCESS (Took   0.01s) postgres       : default:commcarehq:OK p1:commcarehq_p1:OK p2:commcarehq_p2:OK proxy:commcarehq_proxy:OK synclogs:commcarehq_synclogs:OK ucr:commcarehq_ucr:OK Successfully got a user from postgres
SUCCESS (Took   0.01s) couch          : Successfully queried an arbitrary couch view
SUCCESS (Took   0.04s) celery         : OK
SUCCESS (Took   0.02s) elasticsearch  : Successfully sent a doc to ES and read it back
SUCCESS (Took   0.06s) blobdb         : Successfully saved a file to the blobdb
SUCCESS (Took   0.03s) formplayer     : Formplayer returned a 200 status code: https://inddex24.org/formplayer/serverup
SUCCESS (Took   0.00s) rabbitmq       : RabbitMQ OK
Connection to 10.1.0.4 closed

...however a few minutes later there was an issue:

FAILURE (Took   0.04s) celery         : async_restore_queue has been blocked for 0:06:18.871648 (max allowed is 0:01:00)
background_queue is delayed for 17:25:42.964762 (max allowed is 0:10:00)
case_import_queue has been blocked for 0:06:27.008177 (max allowed is 0:01:00)
case_rule_queue is delayed for 17:25:02.901198 (max allowed is 1:00:00)
celery has been blocked for 0:06:18.885239 (max allowed is 0:01:00)
email_queue has been blocked for 0:06:27.001671 (max allowed is 0:00:30)
export_download_queue has been blocked for 0:06:18.898530 (max allowed is 0:00:30)
ucr_queue is delayed for 17:24:52.913855 (max allowed is 1:00:00)

...another log tail showed the same error as before:

==> /home/cchq/www/monolith/log/celery_flower.log <==
    mod = importlib.import_module(self.SETTINGS_MODULE)
  File "/usr/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/home/cchq/www/monolith/releases/2022-02-28_03.59/settings.py", line 1601, in <module>
    SHARED_DRIVE_CONF = helper.SharedDriveConfiguration(
  File "/home/cchq/www/monolith/releases/2022-02-28_03.59/settingshelper.py", line 27, in __init__
    self.blob_dir = self._init_dir(blob_dir)
  File "/home/cchq/www/monolith/releases/2022-02-28_03.59/settingshelper.py", line 35, in _init_dir
    os.mkdir(path)
FileExistsError: [Errno 17] File exists: '/opt/shared_monolith/blobdb'

==> /home/cchq/www/monolith/log/celery_repeat_record_queue,celery,case_import_queue,background_queue,export_download_queue,saved_exports_queue,analytics_queue,ucr_queue,async_restore_queue,email_queue,case_rule_queue,celery_periodic_0.log <==
[2022-03-07 12:51:48,805: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__export_download_queue[f2d8b3d3-72b3-47e3-bcb1-c91a3387cc30]  
[2022-03-07 12:51:50,512: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__async_restore_queue[a7bcd486-51b3-4ba8-92fc-c736b776997d]  
[2022-03-07 12:51:50,537: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__celery_periodic[93a8d729-446e-4f35-ae70-85bb46ce7bfd]  
[2022-03-07 12:51:50,542: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__repeat_record_queue[0d543c06-4c74-4d59-9a7f-ee4c464bf1ab]  
[2022-03-07 12:51:50,546: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__case_rule_queue[e303dc13-ace9-45a1-b9ea-429274b95a7d]  
[2022-03-07 12:51:50,560: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__analytics_queue[f7a3beee-1d7a-4069-a129-41fc3df8f8a4]  
[2022-03-07 12:51:50,564: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__celery[4c14ff38-ebf7-47fc-beae-018113b6f146]  
[2022-03-07 12:51:50,570: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__case_import_queue[58e98441-65b8-4f1b-b806-178b5be22c41]  
[2022-03-07 12:51:50,574: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__ucr_queue[5b243dea-54ba-463d-ac66-77259ddaaf3a]  
[2022-03-07 12:51:50,579: INFO/MainProcess] Received task: corehq.celery_monitoring.heartbeat.heartbeat__background_queue[fe5c955d-f5b6-478b-94a0-8b82e6395b73]

/opt/shared_monolith directory:

rwxr-xr-x 3 cchq   cchq 4096 Mar  7 12:56 blobdb
drwxrwxr-x 2 nobody nfs  4096 Jun 18  2021 restore_payloads
drwxrwxr-x 2 nobody nfs  4096 Jun 18  2021 temp
drwxrwxr-x 2 nobody nfs  4096 Mar  7 10:45 transfer_payloads
drwxr-xr-x 2 cchq   cchq 4096 Jun 18  2021 tzmigration-planning

There is certainly something going on that's changing filesystem objects (symlinks, directories, perhaps permisions, etc). To clarify a few things:

  1. The permissions of the symlink /opt/shared_monolith/blobdb are fine. E.g.

    lrwxrwxrwx 1 root   root   16 Jun 17  2021 blobdb -> /opt/data/blobdb
    

    The symlink can be owned root:root (that's not a problem). What's important is that the directory it's pointing to exists and has the correct permissions; which you've confirmed it does (sometimes), so you're good here.

  2. The permissions on the /opt/shared_monolith directory concerned me at first. E.g.

    drwxrwxr-x  7 nobody  nfs  4096 Mar  5 13:24 shared_monolith
    

    Since it's only "world readable", at first I thought the cchq user can't write here, but then I noticed that your cchq user is a member of the nfs group, so that makes this directory writable by the cchq user. Again, all good here.

  3. The only permissions I'm curious about at this point are the ones on the /opt/data directory itself (e.g. ls -ld /opt/data); but after seeing your post about the symlink being present and later vanishing I'm less worried about this. If the cchq user didn't have at least read and execute permissions on the /opt/data directory, then celery wouldn't ever be working.

I think you have some sort of "background" transient issue. Some unrelated process is changing the filesystem "structure" underneath the running application. This could be one or more of:

  • One or more of your mountpoints is periodically getting unmounted and re-mounted. This would manifest itself as the "correct files" being present when it's mounted, and then "weird files" being present when it's not (i.e. the "weird files" are located on the / filesystem and are periodically getting "hidden" when the mount is active). I have no idea why this would be happening.
  • Another HQ (django, celery, etc) process is still running and stomping on the files (although I don't understand why it would be doing what you're seeing).
  • Another unrelated process is stomping on the files.
  • One of these directory structures is on an NFS share, and some other system (process on another OS) also has the share mounted and is stomping on the files.

I noticed that the /opt/shared_monolith directory is owned nobody:nfs. Is this a mounted network filesystem? If so, is it possible that some other OS is trying to change the same files that Celery is using?

Hmm... I'll do some digging, but there aren't any background processes installed outside of what Commcare-Cloud / Ansible have set up. The one thing that's different is that /opt/data is present on another physical disk to the OS which is mounted at boot. Without that disk being mounted, /opt/data and everything under it doesn't exist. I'll shut down services and unmount it to confirm the original location is empty but I believe it is.

I'm going to hunt for OS logs to see the history of that mount to confirm... but my understanding is that if that mount disappears, the whole system will crash as /opt/data will be gone. It's only Celery that seems to be struggling for some reason.

I've noticed this error that appears in the celery logs:

==> /home/cchq/www/monolith/log/celery_flower.log <==
    mod = importlib.import_module(self.SETTINGS_MODULE)
  File "/usr/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/home/cchq/www/monolith/releases/2022-02-28_03.59/settings.py", line 1601, in <module>
    SHARED_DRIVE_CONF = helper.SharedDriveConfiguration(
  File "/home/cchq/www/monolith/releases/2022-02-28_03.59/settingshelper.py", line 27, in __init__
    self.blob_dir = self._init_dir(blob_dir)
  File "/home/cchq/www/monolith/releases/2022-02-28_03.59/settingshelper.py", line 35, in _init_dir
    os.mkdir(path)
FileExistsError: [Errno 17] File exists: '/opt/shared_monolith/blobdb'

Appears to be related to running a django-manage check_services
I put a watch on ls /opt/shared_monolith -l and every time check_services runs (just before it completes it seems), the /opt/shared_monolith/blobdb symlink timestamp updates - is it being recreated during the check_services?

At the same time, the "FileExistsError: [Errno 17] File exists: '/opt/shared_monolith/blobdb" error appears in the /home/cchq/www/monolith/log/celery_flower.log
It doesn't, however, appear to be related to celery stalling.

EDIT that last statement may be incorrect, I'm making an assumption on the appearance of a new "FileExistsError" entry in celery_flower.log every time I run check_services. Let me double check that.
Yep, apologies, that error seems to have been a bit of a red herring. It's not appearing every time I do a django-manage check_services. The blobdb symlink timestamp is being updated though. Possibly normal behaviour. Celery has been up for 30 minutes or so now - the longest in a while. I'm going to try syncing with the server to see how it behaves.

Yes, that bit of code in settingshelper.py that checks if the blobdb directory exists (and tries to create it if not) gets run when any django-manage command is executed (because it originates from a top-level settings.py call, which will run every time a Django operation is executed). The problem is: Why does the previous line (the one above the os.mkdir(path)) evaluate to True with your symlink? E.g. this if statement:

if not os.path.exists(path):

Every time the os.mkdir(path) line is executed, it's because it thinks the symlink is an invalid path, which is strange. The following django-manage shell command will print the Python "stat" of the blobdb directory. I'd be curious what you get when the symlink is in place.

cchq monolith django-manage shell -c 'import os; from django.conf import settings; print(os.stat(settings.SHARED_DRIVE_CONF.blob_dir))'

When I run this on one of my environments, I see the following output:

$ cchq ... django-manage shell -c 'import os; from django.conf import settings; print(os.stat(settings.SHARED_DRIVE_CONF.blob_dir))'

# ... snip (some Django logging output) ...

Ubuntu 18.04.5 LTS
os.stat_result(st_mode=16877, st_ino=8285781285710385096, st_dev=51, st_nlink=2, st_uid=65534, st_gid=1500, st_size=6144, st_atime=1615179586, st_mtime=1556318031, st_ctime=1615179586)
Connection to ... closed.

Hey Joel, I get this:

Ubuntu 18.04.5 LTS
os.stat_result(st_mode=16877, st_ino=1355325, st_dev=2065, st_nlink=3, st_uid=1002, st_gid=1003, st_size=69632, st_atime=1646653613, st_mtime=1646721819, st_ctime=1646721819)

The problem is: Why does the previous line (the one above the os.mkdir(path) ) evaluate to True with your symlink?

Does it though? My references to that FileExistsError: [Errno 17] appears to be a red herring unfortunately as I've been tailing /home/cchq/www/monolith/log/celery_*.log which seems to insert the contents of the celery_flower.log at the top every time followed by the other (longer name) log... and that error in celery_flower.log is actually an old message and unrelated. It's repeated a number of times but I don't know think it has been appended recently. I made a copy of it last night and did a compare after a couple of celery failure / restarts and there has been no change so apologies for that if it's caused confusion.

FileExistsError: [Errno 17] appears to be a red herring

Ok, yes I'm with you now. I should have read your previous update more closely, you stated this clearly earlier.

1 Like

So Celery was falling over roughly once an hour (sometimes more). I've performed an update and deployed 2022-03-07_08.11-production-deploy
I'll update this thread if the issue is resolved. Thanks everyone for your input so far, it's greatly appreciated!

Note - this did appear to resolve our celery crashing issue. Celery is now very stable on this server. We have another server that appears to be having celery issues with the latest build. I'm going to deploy the same build as above and see if that resolves it.