Downtime start unable to shut down all services

erobinson · March 8, 2022, 6:07pm

I'm unable to stop services using commcare-cloud monolith downtime start

Even if I opt to kill services, I'm still getting the following when checking services:

FAILURE (Took   0.05s) kafka          : Could not connect to Kafka: NoBrokersAvailable
SUCCESS (Took   0.00s) redis          : Redis is up and using 193.09M memory
SUCCESS (Took   0.01s) postgres       : default:commcarehq:OK p1:commcarehq_p1:OK p2:commcarehq_p2:OK proxy:commcarehq_proxy:OK synclogs:commcarehq_synclogs:OK ucr:commcarehq_ucr:OK Successfully got a user from postgres
EXCEPTION (Took   0.00s) couch          : Service check errored with exception 'ConnectionError(MaxRetryError("HTTPConnectionPool(host='127.0.0.1', port=35984): Max retries exceeded with url: /commcarehq__apps (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fda8f851370>: Failed to establish a new connection: [Errno 111] Connection refused'))"))'
FAILURE (Took   0.01s) celery         : async_restore_queue has been blocked for 0:17:13.526189 (max allowed is 0:01:00)
background_queue has been blocked for 0:17:13.501510 (max allowed is 0:10:00)
case_import_queue has been blocked for 0:17:13.487967 (max allowed is 0:01:00)
celery has been blocked for 0:17:13.499305 (max allowed is 0:01:00)
celery_periodic has been blocked for 0:17:13.517133 (max allowed is 0:10:00)
email_queue has been blocked for 0:17:13.493053 (max allowed is 0:00:30)
export_download_queue has been blocked for 0:17:13.496966 (max allowed is 0:00:30)
EXCEPTION (Took   0.00s) elasticsearch  : Service check errored with exception 'ConnectionError('N/A', '<urllib3.connection.HTTPConnection object at 0x7fda8f5e2370>: Failed to establish a new connection: [Errno 111] Connection refused', NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fda8f5e2370>: Failed to establish a new connection: [Errno 111] Connection refused'))'
SUCCESS (Took   0.05s) blobdb         : Successfully saved a file to the blobdb
FAILURE (Took   0.01s) formplayer     : Could not connect to formplayer: https://inddex24.org/formplayer/serverup
SUCCESS (Took   0.00s) rabbitmq       : RabbitMQ OK
Connection to 10.1.0.4 closed.

Trying to manually stop redis and postgresql returns the following:
Redis:

10.1.0.4 | FAILED! => {
    "changed": false,
    "msg": "redis-server process not presently configured with monit",
    "name": "redis-server",
    "state": "stopped"
}

PostgreSQL:

ansible 'postgresql,pg_standby,!remote_postgresql' -m monit -i /home/ccc/environments/monolith/inventory.ini -a 'name=postgresql_9.6 state=stopped' --diff -u ansible --become -e @/home/ccc/environments/monolith/public.yml -e @/home/ccc/environments/monolith/.generated.yml -e @/home/ccc/environments/monolith/vault.yml --vault-password-file=/home/ccc/commcare-cloud/src/commcare_cloud/ansible/echo_vault_password.sh '--ssh-common-args=-o UserKnownHostsFile=/home/ccc/environments/monolith/known_hosts'
[WARNING]: Could not match supplied host pattern, ignoring: remote_postgresql
10.1.0.4 | SUCCESS => {
    "changed": false,
    "name": "postgresql_9.6",
    "state": "stopped"
}
ansible 'postgresql,pg_standby,!remote_postgresql' -m monit -i /home/ccc/environments/monolith/inventory.ini -a 'name=pgbouncer state=stopped' --diff -u ansible --become -e @/home/ccc/environments/monolith/public.yml -e @/home/ccc/environments/monolith/.generated.yml -e @/home/ccc/environments/monolith/vault.yml --vault-password-file=/home/ccc/commcare-cloud/src/commcare_cloud/ansible/echo_vault_password.sh '--ssh-common-args=-o UserKnownHostsFile=/home/ccc/environments/monolith/known_hosts'
[WARNING]: Could not match supplied host pattern, ignoring: remote_postgresql
10.1.0.4 | SUCCESS => {
    "changed": false,
    "name": "pgbouncer",
    "state": "stopped"
}

Running check_services after the above reveals services are still running:

SUCCESS (Took   0.00s) redis          : Redis is up and using 193.07M memory
SUCCESS (Took   0.01s) postgres       : default:commcarehq:OK p1:commcarehq_p1:OK p2:commcarehq_p2:OK proxy:commcarehq_proxy:OK synclogs:commcarehq_synclogs:OK ucr:commcarehq_ucr:OK Successfully got a user from postgres

What could be causing this? I'd like to run an upgrade but am wary of these services that can't be stopped

Simon_Kelly · March 14, 2022, 12:52pm

It's possible there have been changes in how these processes are named / managed since they were set up. You could try run the Ansible setup playbooks to see if there are changes:

cchq <env> ap deploy_postgres.yml
cchq <env> ap deploy_redis.yml