Update-config failing

erobinson · April 30, 2021, 3:58pm

I'm receiving the following error when running an update-config:

TASK [Update formplayer config files] *******************************************************************************************************************************************************
failed: [197.x.x.x] (item={'template': 'application.properties.j2', 'filename': 'application.properties'}) => {"ansible_loop_var": "item", "changed": false, "checksum": "2b20014c1c7337c03379e4086988c2509203ddae", "item": {"filename": "application.properties", "template": "application.properties.j2"}, "msg": "Destination directory /home/cchq/www/monolith/formplayer_build/current does not exist"}
failed: [197.x.x.x] (item={'template': 'logback-spring.xml.j2', 'filename': 'logback-spring.xml'}) => {"ansible_loop_var": "item", "changed": false, "checksum": "991ac7d8c561de95014c81e96ce3db960a43dd2d", "item": {"filename": "logback-spring.xml", "template": "logback-spring.xml.j2"}, "msg": "Destination directory /home/cchq/www/monolith/formplayer_build/current does not exist"}

Shortly before, the error, I get these warnings:

This command will apply without running the check first. Continue? [y/N]y
ansible-playbook /home/ccc/commcare-cloud/src/commcare_cloud/ansible/deploy_stack.yml -i /home/ccc/environments/monolith/inventory.ini -e @/home/ccc/environments/monolith/public.yml -e @/home/ccc/environments/monolith/.generated.yml --diff --limit all --tags after-reboot -u ansible -e @/home/ccc/environments/monolith/vault.yml --vault-password-file=/home/ccc/commcare-cloud/src/commcare_cloud/ansible/echo_vault_password.sh '--ssh-common-args=-o UserKnownHostsFile=/home/ccc/environments/monolith/known_hosts'
Vault Password for 'monolith':
[WARNING]: Could not match supplied host pattern, ignoring: cas_proxy
[WARNING]: Could not match supplied host pattern, ignoring: pna_proxy
[WARNING]: Could not match supplied host pattern, ignoring: reach_proxy
[WARNING]: Could not match supplied host pattern, ignoring: plproxy
[WARNING]: Could not match supplied host pattern, ignoring: citusdb
[WARNING]: Could not match supplied host pattern, ignoring: commcarehq
[WARNING]: Could not match supplied host pattern, ignoring: airflow_scheduler
[WARNING]: Could not match supplied host pattern, ignoring: shared_efs_client_host
[WARNING]: Could not match supplied host pattern, ignoring: logproxy

I have checked and the directory mentioned does indeed not exist. Which process is meant to create that directory? Note, I had recently run an update-code and deploy.

Note that I also receive this after an after-reboot all:

TASK [Updating resolv.conf symbolic link] **********************************************************************************************************************************
fatal: [197.x.x.x]: FAILED! => {"changed": false, "gid": 0, "group": "root", "mode": "0644", "msg": "refusing to convert from file to symlink for /etc/resolv.conf", "owner": "root", "path": "/etc/resolv.conf", "size": 22, "state": "file", "uid": 0}

For reference, I get this on boot-up at login:

Downloading dependencies from galaxy and pip
ansible-galaxy install -f -r /home/ccc/commcare-cloud/src/commcare_cloud/ansible/requirements.yml
/home/ccc
-bash: wait: %2: no such job
[WARNING]: - dependency andrewrothstein.couchdb (v2.1.4) (v2.1.4) from role
andrewrothstein.couchdb-cluster differs from already installed version
(v2.1.5), skipping
[WARNING]: - dependency ANXS.cron (None) from role tmpreaper differs from
already installed version (v1.0.2), skipping
[WARNING]: - dependency sansible.java (None) from role sansible.logstash
differs from already installed version (v2.1.4), skipping
[WARNING]: - dependency sansible.users_and_groups (None) from role
sansible.logstash differs from already installed version (v2.0.5), skipping

Thanks!

Ethan_Soergel · April 30, 2021, 6:02pm

Hi Ed, we recently made some changes to the formplayer deploy scripts, so this might be a newly introduced bug. Thanks for raising the issue, I'll flag it internally and hopefully we can get a fix out soon.

Ethan_Soergel · April 30, 2021, 6:07pm

Also, to rule out complicating factors, can you confirm that your environment is otherwise up-to-date? Ie, latest version of commcare-cloud, updated requirements, followed relevant changelog entries.

Is this a new environment or an older one that you're making some tweaks to?

erobinson · April 30, 2021, 6:14pm

Hi Ethan, thanks for the quick response. It's an older environment that was a tad out of date. I'll go through changelogs with a fine toothed comb and see what has been missed. I do know that the Elasticsearch update still needs to be done, though I need the config update to go through before I can attempt that.

Ethan_Soergel · April 30, 2021, 8:25pm

I just looked through the config a bit, and it looks like it's been this way since late 2019. If the /home/cchq/www/monolith/formplayer_build/ directory doesn't exist, you may need to redeploy formplayer, which should configure the directories appropriately.

erobinson · May 3, 2021, 11:11am

The formplayer deploy appears to be failing. The output can be seen here: https://pastebin.com/pE3QdCGg

Any other advice on where I should be looking?
Thanks!

Simon_Kelly · May 3, 2021, 11:35am

Hi Ed

That looks to me like the actual Formplayer deploy is succeeding and it's failing in a post-deploy action (I'll push a change to fix that).

It also looks like there may have been an error when it was trying to send the deploy email notification.

Either way - the actual Formplayer deploy appears to have succeeded.

erobinson · May 3, 2021, 12:01pm

Thanks for confirming that @snopoke

erobinson · May 3, 2021, 12:43pm

I now have an error when running an after-reboot all:

TASK [Updating resolv.conf symbolic link] *************************************************************************************************************************************************
fatal: [197.x.x.x]: FAILED! => {"changed": false, "gid": 0, "group": "root", "mode": "0777", "msg": "refusing to convert from file to symlink for /etc/resolv.conf", "owner": "root", "path": "/etc/resolv.conf", "size": 22, "state": "file", "uid": 0}

PLAY RECAP ********************************************************************************************************************************************************************************
197.x.x.x : ok=10 changed=0 unreachable=0 failed=1 skipped=12 rescued=0 ignored=0

(cchq) ccc@monolith:~/commcare-cloud$ ls -l /etc/resolv.conf
-rwxrwxrwx 1 root root 22 Dec 15 13:49 /etc/resolv.conf

Any ideas on this one? I'll dig to see what's causing the failure in the mean time. Thanks!

Simon_Kelly · May 3, 2021, 1:04pm

What version of Ubuntu are you running?

erobinson · May 3, 2021, 1:14pm

We're running 18.04.5 LTS

Simon_Kelly · May 3, 2021, 1:43pm

Looks like this is due to a fairly recent change in CommCare cloud. I think the resolution to to remove the /etc/resolve.conf file and re-run the task but I'm waiting on feedback from some others: internal domain: updating resolv.conf symbolic link for OS-distribution Ubuntu by rameshganne · Pull Request #4432 · dimagi/commcare-cloud · GitHub

erobinson · May 4, 2021, 11:03am

Just to confirm that after-reboot fails without the /etc/resolv.conf file (probably expected):

fatal: [197.x.x.x]: FAILED! => {"msg": "An unhandled exception occurred while running the lookup plugin 'dig'. Error was a <class 'dns.resolver.NoResolverConfiguration'>, original message: Resolver configuration could not be read or specified no nameservers."}

I resolved this issue by assigning rights for /etc and the /etc/resolv.conf file to ansible user - obviously not ideal, but it was then able to generate the symlink OK:

(cchq) ccc@monolith:~/commcare-cloud$ ls /etc/resolv.conf -l
lrwxrwxrwx 1 root root 32 May 4 12:00 /etc/resolv.conf -> /run/systemd/resolve/resolv.conf

I'll reset the rights to /etc in the mean time.
That said, the after-reboot is now failing with this:

RUNNING HANDLER [Restart pgbouncer] **********************************************************************************************************************************************************
failed: [197.211.237.144] (item=1) => {"ansible_loop_var": "item", "changed": false, "item": "1", "msg": "Could not find the requested service pgbouncer-multiprocess@1: host"}

I'll dig in the mean time but any pointers or suggestions are welcome!

Simon_Kelly · May 4, 2021, 12:52pm

Hi Ed

There was a fix made for the resolve.conf issue (checking resolv.conf status - Path exists and is a symlink by rameshganne · Pull Request #4699 · dimagi/commcare-cloud · GitHub) though it sounds like in your case you may have needed to replce the file in any case.

Regarding the pgbouncer restart issue, there was a change earlier this year to how pgbouncer is run. It doesn't appear that this change was announced (I'll get on that right away) but what you need to do is to run the deploy_postgres playbook:

cchq <env> ap deploy_postgres.yml --tags pgbouncer

That should remove the old pgbouncer process and create a new one with the correct name etc.

erobinson · May 4, 2021, 12:55pm

Awesome, thanks @snopoke
Update - the deploy_postgres.yml did the trick there, thanks a ton.
Right now there are other issues that I'm working on ironing out. I'm going through the changelogs to make sure I haven't missed anything. If I do a check_services I'm getting these exceptions:

SUCCESS (Took   0.31s) kafka          : Kafka seems to be in order
EXCEPTION (Took   0.13s) redis          : Service check errored with exception 'TypeError("__init__() got an unexpected keyword argument 'health_check_interval'",)'
SUCCESS (Took   0.60s) postgres       : default:commcarehq:OK p1:commcarehq_p1:OK p2:commcarehq_p2:OK proxy:commcarehq_proxy:OK synclogs:commcarehq_synclogs:OK ucr:commcarehq_ucr:OK Successfully got a user from postgres
SUCCESS (Took   0.48s) couch          : Successfully queried an arbitrary couch view
EXCEPTION (Took   0.00s) celery         : Service check errored with exception 'TypeError("__init__() got an unexpected keyword argument 'health_check_interval'",)'
EXCEPTION (Took   0.00s) heartbeat      : Service check errored with exception 'TypeError("__init__() got an unexpected keyword argument 'health_check_interval'",)'
SUCCESS (Took   0.31s) elasticsearch  : Successfully sent a doc to ES and read it back
SUCCESS (Took   2.13s) blobdb         : Successfully saved a file to the blobdb
FAILURE (Took   1.03s) formplayer     : Formplayer returned a 502 status code: https://xxx.xxx.xxx/formplayer/serverup
SUCCESS (Took   0.00s) rabbitmq       : RabbitMQ OK
Connection to 197.x.x.x closed.

erobinson · May 4, 2021, 6:25pm

OK, latest update, I have at least been able to get through some of the changelog items including upgrading Elasticsearch which definitely feels like progress. I'm currently doing the 23 July 2020 update: https://dimagi.github.io/commcare-cloud/changelog/0036-clean-2fa-sessions.html

My assumption that I can deploy the latest code base but perhaps I should be deploying this specific version: "The following version of CommCare must be deployed before rolling out this change: [09a8f4ef]"

Please confirm. Output for the monolith deploy after update-code (latest commcare-cloud) is as follows:

Thanks in advance. P.s. what is the syntax for deploying a specific commcare release?

EDIT I think I found it
cchq deploy --commcare-rev COMMCARE_REV
(The name of the commcare-hq git branch, tag, or SHA-1 commit hash to deploy.)

Simon_Kelly · May 5, 2021, 7:38am

Hi Ed

Regarding Changelog 36, you don't need to deploy a specific version as long as you deploy a version that has the referenced change in it. If you deploy the latest version that will be fine.

I took a look at the deploy output, it's failing when installing some of the python requirements. Specifically it's looking for the xmlsec library. This was added recently and there is a changelog for it as well: https://dimagi.github.io/commcare-cloud/changelog/0040-install-new-apt-requirements.html

Other changelogs that stuck out to me which you might need to run before being able to do a deploy are:

https://dimagi.github.io/commcare-cloud/changelog/0035-new-js-package-manager.html

erobinson · May 5, 2021, 8:04am

Thanks a ton Simon, I'll deploy those. The difficulty I'm having is that we haven't really been responisble for the maintenance of the server so it's been challenging to figure out where we're at. I am making notes as we go.
Cheers!

erobinson · May 5, 2021, 10:42am

Quick question @Simon_Kelly
With a deploy it asks for the 'sudo' password. Is this the password for the currently logged in user (part of the sudoers group) or the root password? I'm assuming sudo for the currently logged in user and assume root account is disabled. At what point does it use that password? The odd thing is the deploy is still failing, but the output is identical no matter what I supply for the sudo password (including junk) - I assume it's not (yet?) using it?

Output is here:

Thanks again!

Simon_Kelly · May 5, 2021, 10:55am

Hi Ed

It should not be prompting for a password - that was added in error and has just been reverted so if you update your commcare-cloud you shouldn't get the prompt.

I see the deploy is now failing when trying to communicate with Redis:

redis.exceptions.BusyLoadingError: Redis is loading the dataset in memory

If Redis was recently restarted you'll need to wait until it is ready to service requests.