*URGENT* 500 error when accessing any application

Following on from this original issue:

We seemed to have resolved those issues with a redeploy of the following:

commit 20365fa396b259769ff4107dcfa6da023d95ce5f (HEAD, tag: 2022-03-07_08.11-production-deploy)

However, today we're getting 500 errors when trying to open any application on any of our project spaces. Which logs should I be looking at? All services report being up and working.
Any advice appreciated!

That type of error should show up in the django logs at
or possibly
If you use sentry, it'd show up there as well.

1 Like

Thanks @Ethan_Soergel this is what I get when clicking virtually any application in the browser:

2022-03-11 09:08:05,748 ERROR [django.request] Internal Server Error: /a/xxxxxxxx-dev/apps/view/cbc4862d14324090845f87a413d8d82a/
Traceback (most recent call last):
  File "/home/cchq/www/monolith/releases/2022-03-08_18.52/corehq/blobs/mixin.py", line 192, in fetch_attachment
    blob = meta.open()
  File "/home/cchq/www/monolith/releases/2022-03-08_18.52/corehq/blobs/models.py", line 111, in open
    return db.get(meta=self)
  File "/home/cchq/www/monolith/releases/2022-03-08_18.52/corehq/blobs/fsdb.py", line 68, in get
    raise NotFound(key)
corehq.blobs.exceptions.NotFound: 2598d588accb4f89920d7892198fd53a

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cchq/www/monolith/releases/2022-03-08_18.52/python_env-3.9/lib/python3.9/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/home/cchq/www/monolith/releases/2022-03-08_18.52/python_env-3.9/lib/python3.9/site-packages/django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/home/cchq/www/monolith/releases/2022-03-08_18.52/python_env-3.9/lib/python3.9/site-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/home/cchq/www/monolith/releases/2022-03-08_18.52/python_env-3.9/lib/python3.9/site-packages/django/views/decorators/http.py", line 40, in inner
    return func(request, *args, **kwargs)
  File "/home/cchq/www/monolith/releases/2022-03-08_18.52/corehq/apps/domain/decorators.py", line 115, in _inner
    return call_view()
  File "/home/cchq/www/monolith/releases/2022-03-08_18.52/corehq/apps/domain/decorators.py", line 85, in call_view
    def call_view(): return view_func(req, domain_name, *args, **kwargs)
  File "/home/cchq/www/monolith/releases/2022-03-08_18.52/corehq/apps/app_manager/views/apps.py", line 692, in view_app
    return view_generic(request, domain, app_id, release_manager=True)
  File "/home/cchq/www/monolith/releases/2022-03-08_18.52/corehq/ex-submodules/dimagi/utils/couch/resource_conflict.py", line 20, in new_fn
    return fn(*args, **kwargs)
  File "/home/cchq/www/monolith/releases/2022-03-08_18.52/corehq/apps/app_manager/views/view_generic.py", line 163, in view_generic
    context.update(get_releases_context(request, domain, app_id))
  File "/home/cchq/www/monolith/releases/2022-03-08_18.52/corehq/apps/app_manager/views/releases.py", line 222, in get_releases_context
    multimedia_state = app.check_media_state()
  File "/home/cchq/www/monolith/releases/2022-03-08_18.52/corehq/apps/hqmedia/models.py", line 997, in check_media_state
    for media in self.all_media():
  File "/home/cchq/www/monolith/releases/2022-03-08_18.52/python_env-3.9/lib/python3.9/site-packages/memoized.py", line 20, in _memoized
    cache[key] = value = fn(*args, **kwargs)
  File "/home/cchq/www/monolith/releases/2022-03-08_18.52/corehq/apps/hqmedia/models.py", line 787, in all_media
  File "/home/cchq/www/monolith/releases/2022-03-08_18.52/corehq/apps/app_manager/models.py", line 1181, in validate_form
    form = self.wrapped_xform()
  File "/home/cchq/www/monolith/releases/2022-03-08_18.52/corehq/apps/app_manager/models.py", line 1173, in wrapped_xform
    return XForm(self.source, domain=self.get_app().domain)
  File "/home/cchq/www/monolith/releases/2022-03-08_18.52/corehq/apps/app_manager/models.py", line 790, in __get__
    source = app.lazy_fetch_attachment(filename)
  File "/home/cchq/www/monolith/releases/2022-03-08_18.52/corehq/apps/app_manager/models.py", line 4022, in lazy_fetch_attachment
    content = self.fetch_attachment(name)
  File "/home/cchq/www/monolith/releases/2022-03-08_18.52/corehq/blobs/mixin.py", line 194, in fetch_attachment
    raise ResourceNotFound(
couchdbkit.exceptions.ResourceNotFound: Application cbc4862d14324090845f87a413d8d82a attachment: '81c3984c748946e0b60afc760dc324bd.xml'

Is it possible something went awry with my recent server upgrade?
If I look at my /opt/blobdb dir, It seems 2598d588accb4f89920d7892198fd53a does indeed exist... going to investigate further
Perhaps this is a rights issue?

-rw-r--r-- 1 cchq cchq 44537 Feb 16 13:20 /opt/data/blobdb/2598d588accb4f89920d7892198fd53a

That's odd - I'm not sure what the cause could be. You said that check_services doesn't show any errors with blobdb? It does a very simple interaction with blobdb - that's might be a good place to start debugging - does blobdb work at all? If you save a new file, do the permissions and file path line up with what you see for the application attachment?

Hey Ethan, yes, according to check_services, it's able to write to blobdb:

(cchq) ccc@monolith:~/commcare-cloud$ cchq monolith django-manage check_servicesssh ccc@ -t -o UserKnownHostsFile=/home/ccc/environments/monolith/known_hosts 'sudo -iu cchq bash -c '"'"'cd /home/cchq/www/monolith/current; python_env/bin/python manage.py check_services'"'"''
Ubuntu 18.04.5 LTS
SUCCESS (Took   0.61s) kafka          : Kafka seems to be in order
SUCCESS (Took   0.00s) redis          : Redis is up and using 206.25M memory
SUCCESS (Took   0.02s) postgres       : default:commcarehq:OK p1:commcarehq_p1:OK p2:commcarehq_p2:OK proxy:commcarehq_proxy:OK synclogs:commcarehq_synclogs:OK ucr:commcarehq_ucr:OK Successfully got a user from postgres
SUCCESS (Took   0.01s) couch          : Successfully queried an arbitrary couch view
SUCCESS (Took   0.01s) celery         : OK
SUCCESS (Took   0.02s) elasticsearch  : Successfully sent a doc to ES and read it back
SUCCESS (Took   0.06s) blobdb         : Successfully saved a file to the blobdb
SUCCESS (Took   0.04s) formplayer     : Formplayer returned a 200 status code: https://xxxxxxxx/formplayer/serverup
SUCCESS (Took   0.00s) rabbitmq       : RabbitMQ OK
Connection to closed.

@Ethan_Soergel I've sent you a link to a copy of the file in the log... all appears fine and it appears well formatted. I was able to access it using the cchq:cchq user (owner) but not sure if it's meant to be owned by another user. I'll compare on another system shortly.

Just a thought... Zookeeper isn't perhaps involved somehow? I ask since we're having zookeeper failures during after_reboot as per this post and have had to bypass that routine to get other services up.

I doubt it - zookeeper is used for managing kafka. They're involved in the asynchronous processes behind data forwarding and keeping elasticsearch and UCR up-to-date. I also don't think this error is related to the contents of the file, but where/how it's stored or how it's being accessed. Does the file ownership look the same as the other files in that directory? If you save a new file to the blobdb, does it get added to the same directory? That's probably easiest to try in a django shell session following the check_services example I linked above.

Hi, yes, check_services appears able to create a file in blobdb according to the output in my post above

SUCCESS (Took   0.06s) blobdb         : Successfully saved a file to the blobdb

All files in the directory and the directory itself are owned by cchq:cchq

EDIT apologies Ethan, I realised after seeing Simon's update in my other thread that you were probably asking me to run the check_services in a django-manage shell. I'll do that ASAP and report the output here.

Output from check_blobdb from django-manage shell:

In [2]: from corehq.apps.hqadmin.service_checks import check_blobdb

In [3]: check_blobdb()
Out[3]: ServiceStatus(success=True, msg='Successfully saved a file to the blobdb', exception=None, duration=None)

Hi all, we are still desperately trying to get to the bottom of this. Do any of the devs / architects have suggestions on the best approach to log variables or other methods on a live server to troubleshoot?

OK, I've found something that may be of interest.
The blob file the system is apparently looking for (but can't find) in the stack trace I posted earlier is sitting in /opt/data/blobdb (ba9a83fee9de48baadd984c742da1aa6)

The newest file(s) in that directory are dated March 7.

I have found another directory called /opt/shared_monolith which has a subdirectory called blobdb.old (dated Mar 7) and another subdir called blobdb dated with today's date and containing files changed in the past two days.

What is /opt/shared_monolith/blobdb versus /opt/data/blobdb? Where can I see which directory it's using for the blobdb, or is it using both?

EDIT I see the blobdb in /opt/shared_monlith should probably be a symlink to the original /opt/data/blobdb directory and not a directory in itself. I am investigating and will report back here.

What you describe does sound like there are two blobdb directories. I just realized you earlier wrote that you had issues running after_reboot, which seems like a plausible cause. One of the things it does is mount the encrypted drive. If this wasn't mounted properly before blobdb came back online, it might've opportunistically created a new directory at the expected mount point. I'm not certain that's what happened here, but I recall seeing something of the sort on a test environment in the past.

Here's where the Filesystem backend for BlobDB is initialized in the code
You could start a Django shell session and inspect settings.SHARED_DRIVE_CONF.blob_dir to see what it resolves to.

1 Like

I think you're spot on @Ethan re: the issue with after_reboot. I'm going to update that thread shortly. After careful analysis of all files in both blobdb directories and merging them (unique files on either side), the system appears to be functioning again. I'm going to test this thoroughly and post here again to confirm. Thanks!

That's great to hear. Good luck!

1 Like