URGENT unable to get server up after reboot

erobinson · March 4, 2022, 9:34pm

We have a production server that is down after a reboot.
There appears to be an issue starting PostgreSQL with the after-reboot script.
We receive the following error:

RUNNING HANDLER [postgresql_base : Start postgres] ************************************************************************************************************
fatal: [10.1.0.4]: FAILED! => {"changed": false, "msg": "Unable to start service postgresql@10-main: Assertion failed on job for postgresql@10-main.service.\n"}

The scripts appear to be incorrectly assuming we're running PostgreSQL 10 when in fact it looks like we're on 9.6

Any help appreciated!

jmiller · March 4, 2022, 10:02pm

Hey Ed, I'm looking into this. I'm not 100% familiar with how you maintain your commcare-cloud version (assuming you keep it current?).

Does your commcare-cloud "src/commcare_cloud/ansible/roles/postgresql_base/defaults/main.yml" file specify PostgreSQL version 10?

jmiller · March 4, 2022, 10:05pm

Ahh, that might be the wrong place to look (looks like that's been at 10 for quite some time). I'm looking into where/how you specify version 9.6 for your stack.

erobinson · March 4, 2022, 10:17pm

Hey Joel, the last update (commcare-cloud and commcareHQ) was performed recently on 28 February 2022.
My src/commcare_cloud/ansible/roles/postgresql_base/defaults/main.yml does indeed specify PostgreSQL version 10

At what point was PostgreSQL upgraded to version 10 - was that a manual step that may have been missed somewhere down the line?

jmiller · March 4, 2022, 10:19pm

Ok, for sanity check, can you check if your commcare-cloud environment's 'postgresql.yml' file, specifically for the 'postgresql_version' under the 'postgres_override' section. I'm specifically interested if it's missing, or if it's there, what version is specified. For example, this env currently sets 9.6 explicitly: commcare-cloud/environments/production/postgresql.yml at master · dimagi/commcare-cloud · GitHub

Edit: 'postgres_override', not 'postgresql_overrides'

jmiller · March 4, 2022, 10:20pm

At what point was PostgreSQL upgraded to version 10

It wasn't, that was my oversight. I think that file just specifies the "default value", an env file (see previous post) can override it. But I'm wondering if this is related to your problem (as in, maybe your env file isn't overriding it?).

erobinson · March 4, 2022, 10:22pm

No, there doesn't appear to be any explicit mention of postgresql_version:

jmiller · March 4, 2022, 10:25pm

Ok, that might be the problem. Let me dig into when the "default" got changed to 10, because if that's causing your problem, then there probably should have been a changelog step for that.

I'm not sure how comfortable you are with "poking" in your env file, but you could add that override to see if it solves your problem:

postgres_override:
  # snip...
  postgresql_version: '9.6'
  # snip...

but I understand you may not wish to do that on a critical cluster. I'll get back to you.

erobinson · March 4, 2022, 10:29pm

Thanks Joel, it's a fairly simple monolith environment and I have snapshotted it so I don't mind giving that a go.
The environment was originally cloned from here: sample-environment/monolith/postgresql.yml at master · dimagi/sample-environment · GitHub

erobinson · March 4, 2022, 10:35pm

Thanks Joel, that seems to have resolved the pg service restart issue. I'm now experiencing another blocker, however:

TASK [zookeeper : start Zookeeper service] ********************************************************************************************************************
fatal: [10.1.0.4]: FAILED! => {"changed": false, "msg": "Could not find the requested service zookeeper-server: host"}

jmiller · March 4, 2022, 10:59pm

We've identified the source of the postgres version problem. For any other community members who find this thread before a fix is deployed, the way to resolve this issue immediately is by updating your env's postgresql.yml file as I outlined above (quoted here for clarity).

postgres_override:
  # snip...
  postgresql_version: '9.6'
  # snip...

jmiller · March 4, 2022, 11:00pm

@erobinson, looking into this now.

jmiller · March 4, 2022, 11:10pm

What do you get when you run the following on the server?

sudo systemctl status zookeeper-server

Update: I'm wondering if this is just a transient systemd issue. Perhaps a sudo systemctl daemon-reload might resolve it?

erobinson · March 4, 2022, 11:12pm

(cchq) ccc@monolith:~/commcare-cloud$ sudo systemctl status zookeeper-server
[sudo] password for ccc: 
Unit zookeeper-server.service could not be found.

jmiller · March 4, 2022, 11:28pm

Going by your systemd output, it doesn't appear that there is a zookeeper-server service unit on your system. I'm not 100% sure if a monolith needs zookeeper or not... I'd need to dig into the commcare-cloud repo a little more to determine this. For completeness, I'd have you try:

sudo systemctl daemon-reload
sudo systemctl status zookeeper-server

If that still results in the "Unit ... could not be found" error, then the next thing I would want to figure out is:

Does your monolith need zookeeper?
- Yes: why is it missing?
- No: why is the update check failing in its absence?

Unfortunately I have to drop offline now, I'll check back later this weekend to see if you made any progress.

erobinson · March 4, 2022, 11:32pm

(cchq) ccc@monolith:~/commcare-cloud$ sudo systemctl daemon-reload
[sudo] password for ccc: 
(cchq) ccc@monolith:~/commcare-cloud$ sudo systemctl status zookeeper-server
Unit zookeeper-server.service could not be found.

It does indeed look like it's missing. To be honest, I'm not familiar with it on any of our installs.
Thanks a ton for the assistance thus far. I will report back on one of the other servers ASAP.

jmiller · March 4, 2022, 11:33pm

No worries. I'm sure one of the DevOps engineers could sort this out quickly. Good luck, I'll check back in when I'm able.

erobinson · March 5, 2022, 1:48pm

A quick update... I disabled the zookeeper reload after_reboot and got the server up except for celery. Celery is has been blocked and restarting it doesn't appear to get anything going. We are on a build from 28 Feb. I'll start a separate thread about the Celery issue but am happy to assist in troubleshooting and resolving this issue.
Thanks

*URGENT* unable to get server up after reboot

URGENT unable to get server up after reboot