After-reboot on a monolith consistently fails on start Zookeeper service

When running after-reboot, I consistently get an error on the Zookeeper start play:

PLAY [Zookeeper] ***********************************************************************************************************************************************************************

TASK [zookeeper : start Zookeeper service] *********************************************************************************************************************************************
fatal: []: FAILED! => {"changed": false, "msg": "Could not find the requested service zookeeper-server: host"}

At this point, if I do a check_services, there are always 3 failed services:

FAILURE (Took   0.05s) kafka          : Could not connect to Kafka: NoBrokersAvailable
SUCCESS (Took   0.00s) redis          : Redis is up and using 533.52M memory
SUCCESS (Took   0.05s) postgres       : default:commcarehq:OK p1:commcarehq_p1:OK p2:commcarehq_p2:OK proxy:commcarehq_proxy:OK synclogs:commcarehq_synclogs:OK ucr:commcarehq_ucr:OK Successfully got a user from postgres
SUCCESS (Took   0.06s) couch          : Successfully queried an arbitrary couch view
FAILURE (Took   0.00s) celery         : async_restore_queue has been blocked for 0:24:22.778680 (max allowed is 0:01:00)
background_queue has been blocked for 0:24:22.792954 (max allowed is 0:10:00)
case_import_queue has been blocked for 0:24:22.810978 (max allowed is 0:01:00)
celery has been blocked for 0:24:22.799202 (max allowed is 0:01:00)
celery_periodic has been blocked for 0:24:22.788920 (max allowed is 0:10:00)
email_queue has been blocked for 0:24:22.767711 (max allowed is 0:00:30)
export_download_queue has been blocked for 0:24:22.891070 (max allowed is 0:00:30)
SUCCESS (Took   0.18s) elasticsearch  : Successfully sent a doc to ES and read it back
SUCCESS (Took   0.16s) blobdb         : Successfully saved a file to the blobdb
FAILURE (Took   0.14s) formplayer     : Formplayer returned a 502 status code:
SUCCESS (Took   0.00s) rabbitmq       : RabbitMQ OK

I can get them up OK with

cchq monolith service celery restart
cchq monolith service kafka restart
cchq monolith service formplayer restart

...however, even when they're all up, I get the same zookeeper start error if I run the after-reboot playbook. I'm concerned something's amiss and perhaps it's not the final task in the playbook and other tasks are not getting run?

Any advice is appreciated!

Agree that problem existed. First error is the same, but after

cchq monolith service kafka restart

I've got

ansible zookeeper -m service -i /home/lamp/environments/monolith/inventory.ini -a 'name=zookeeper state=restarted' --diff -u ansible --become -e @/home/lamp/environments/monolith/public.yml -e @/home/lamp/environments/monolith/.generated.yml -e @/home/lamp/environments/monolith/vault.yml --vault-password-file=/home/lamp/commcare-cloud/src/commcare_cloud/ansible/ '--ssh-common-args=-o UserKnownHostsFile=/home/lamp/environments/monolith/known_hosts' | FAILED! => {
"changed": false,
"msg": "Could not find the requested service zookeeper: host"

It seems that problem is the same as

Just done fresh monolith install on VM Ubuntu 18.04, got the same error after restarting kafka service

(cchq) lamp@capi:~/environments/monolith$ cchq monolith service kafka restart
Vault Password for 'monolith':
ansible kafka -m service -i /home/lamp/environments/monolith/inventory.ini -a 'name=kafka-server state=restarted' --diff -u ansible --become -e @/home/lamp/environments/monolith/public.yml -e @/home/lamp/environments/monolith/.generated.yml -e @/home/lamp/environments/monolith/vault.yml --vault-password-file=/home/lamp/commcare-cloud/src/commcare_cloud/ansible/ '--ssh-common-args=-o UserKnownHostsFile=/home/lamp/environments/monolith/known_hosts'
ansible zookeeper -m service -i /home/lamp/environments/monolith/inventory.ini -a 'name=zookeeper state=restarted' --diff -u ansible --become -e @/home/lamp/environments/monolith/public.yml -e @/home/lamp/environments/monolith/.generated.yml -e @/home/lamp/environments/monolith/vault.yml --vault-password-file=/home/lamp/commcare-cloud/src/commcare_cloud/ansible/ '--ssh-common-args=-o UserKnownHostsFile=/home/lamp/environments/monolith/known_hosts' | FAILED! => {
"changed": false,
"msg": "Could not find the requested service zookeeper: host"

Additional info: when trying cchq monolith service zookeeper status, just got:

usage: cchq {monolith} service [-h] [--limit LIMIT] [--only PROCESS_PATTERN]
[{celery,citusdb,commcare,couchdb2,elasticsearch,elasticsearch-classic,formplayer,kafka,nginx,pillowtop,postgresql,rabbitmq,redis,webworker} ...]
cchq {monolith} service: error: argument services: invalid choice: 'zookeeper' (choose from 'celery', 'citusdb', 'commcare', 'couchdb2', 'elasticsearch', 'elasticsearch-classic', 'formplayer', 'kafka', 'nginx', 'pillowtop', 'postgresql', 'rabbitmq', 'redis', 'webworker')

It seems like zookeeper service wasn't even created by the script.

So, how to make zookeper service alive? Any help would be appreciated)


Did you check if kafka and zookeeper processes are running without errors by running below?

sudo service kafka-server status
sudo service zookeeper-server status

Thanks for reply. Here's my results for sudo service kafka-server status:

(cchq) lamp@monolith:~/commcare-cloud$ sudo service kafka-server status
[sudo] password for lamp:
● kafka-server.service - Apache Kafka server (broker)
Loaded: loaded (/etc/systemd/system/kafka-server.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2022-07-05 19:35:35 UTC; 4min 51s ago
Main PID: 70144 (java)
Tasks: 45 (limit: 19660)
CGroup: /system.slice/kafka-server.service
└─70144 java -Xmx1G -Xms1G -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:
Jul 05 19:35:35 systemd[1]: Started Apache Kafka server (broker).
Jul 05 19:18:52 systemd[1]: Started Apache Kafka server (broker).

And for sudo service zookeeper-server status:

(cchq) lamp@capi:~/commcare-cloud$ sudo service zookeeper-server status
[sudo] password for lamp:
● zookeeper-server.service - Apache Zookeeper server
Loaded: loaded (/etc/systemd/system/zookeeper-server.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Tue 2022-07-05 19:35:56 UTC; 8min ago
Docs: Index of /doc
Process: 4277 ExecStop=/opt/zookeeper/bin/ stop /opt/zookeeper/conf/zoo.cfg (code=exited, status=0/SUCCESS)
Process: 4206 ExecStart=/opt/zookeeper/bin/ start /opt/zookeeper/conf/zoo.cfg (code=exited, status=0/SUCCESS
Main PID: 4065 (code=exited, status=2)
Jul 05 19:35:55[4206]: grep: /opt/zookeeper/conf/zoo.cfg: No such file or directory
Jul 05 19:35:55[4206]: mkdir: cannot create directory ‘’: No such file or directory
Jul 05 19:35:56[4206]: Starting zookeeper ... STARTED
Jul 05 19:35:56[4277]: ZooKeeper JMX enabled by default
Jul 05 19:35:56[4277]: Using config: /opt/zookeeper/conf/zoo.cfg
Jul 05 19:35:56[4277]: grep: /opt/zookeeper/conf/zoo.cfg: No such file or directory
Jul 05 19:35:56[4277]: mkdir: cannot create directory ‘’: No such file or directory
Jul 05 19:35:56[4277]: Stopping zookeeper ... /opt/zookeeper/bin/ line 182: kil
Jul 05 19:35:56[4277]: STOPPED
Jul 05 19:35:56 systemd[1]: Started Apache Zookeeper server.
lines 1-18/18 (END)

It looks like zookeper config file hasn't been created via script.

This issue has just been fixed.

Can you please do update-code and then deploy zookeeper using commcare-cloud $env_name deploy-stack --skip-check -tags=zookeeper? That should create the missing config file and restart the zookeeper. You can verify that it's fixed by re-looking at the status of the zookeeper process.

Thanks! It worked well after redeploying on fresh VM. But now at the stage commcare-cloud cchq deploy I've got a warning:

[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details
[] Executing task '_task'
[] run: git rev-parse HEAD
[] Passphrase for private key:
[] out: fatal: unsafe repository ('/home/cchq/www/cchq/releases/2022-07-07_12.48' is owned by someone else)
[] out: To add an exception for this directory, call:
[] out:
[] out: git config --global --add /home/cchq/www/cchq/releases/2022-07-07_12.48
[] out:
Warning: run() received nonzero return code 128 while executing 'git rev-parse HEAD'!
Diff generation skipped. Supply a Github token to see deploy diffs.
New version details:
Branch deployed : commcare: master
Here's the complete diff on github: unsafe repository ('/home/cchq/www/cchq/releases/2022-07-07_12.48' is owned by someone else)
To add an exception for this directory, call:
git config --global --add /home/cchq/www/cchq/releases/2022-07-07_12.48...87ae2f7a767ef8541bbe20a917c3b2ad16b2ebc1
Are you sure you want to preindex and deploy to cchq? [y/N]

But executing git config --global --add /home/cchq/www/cchq/releases/2022-07-07_12.48 hasn't helped. The same warning.
After ignoring it I've got:

Are you sure you want to preindex and deploy to cchq? [y/N]y
Vault Password for 'cchq':

Sending email: lamp has initiated a CommCare HQ deploy to cchq
Ubuntu 18.04.6 LTS
Enter passphrase for key '/home/lamp/.ssh/id_rsa':
Connection to closed.
commcare-cloud cchq fab deploy_commcare --set code_branch=master --branch master
fab -f /home/lamp/commcare-cloud/src/commcare_cloud/ cchq deploy_commcare --set code_branch=master --disable-known-hosts --system-known-hosts /home/lamp/environments/cchq/known_hosts
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details
Using commcare-hq branch master
[] Executing task 'deploy_commcare'
[] Executing task '_setup_release'
[] Executing task 'create'
[] sudo: mkdir -p /home/cchq/www/cchq/releases/2022-07-07_14.18

Any help would be appreciated.


To run any subsequent cchq commands after quick-install, you need to be SSHed into the VM as either the ansible user or the ssh_username that was set in the install-config.yml with SSH host forwarding (that is ssh -A ansible@VM_IP pr ssh -A ssh_username@VM_IP)

Thanks for quick reply! Yes, I'm connected to my VM under user lamp that was set in the install-config.yml. And after that I've gov strange error about repository's rights:

(cchq) lamp@monolith:~/commcare-cloud$ commcare-cloud cchq deploy
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details
[] Executing task '_task'
[] run: git rev-parse HEAD
[] Passphrase for private key:
[] out: fatal: unsafe repository ('/home/cchq/www/cchq/releases/2022-07-07_12.48' is owned by someone else)
[] out: To add an exception for this directory, call:
[] out:
[] out: git config --global --add /home/cchq/www/cchq/releases/2022-07-07_12.48
[] out:
Warning: run() received nonzero return code 128 while executing 'git rev-parse HEAD'!
Diff generation skipped. Supply a Github token to see deploy diffs.
New version details:
Branch deployed : commcare: master
Here's the complete diff on github: unsafe repository ('/home/cchq/www/cchq/releases/2022-07-07_12.48' is owned by someone else)
To add an exception for this directory, call:
git config --global --add /home/cchq/www/cchq/releases/2022-07-07_12.48...f27f74680137509e8db1c7543411bfdfdd5569ee
Are you sure you want to preindex and deploy to cchq? [y/N]y

Executing git config --global --add /home/cchq/www/cchq/releases/2022-07-07_12.48 doesn't help.
It is the same on VM installed in manual monolith mode and VM configured by quick install script.

Any clue about it? Thanks in advance!

The same things when executing via ssh -A lamp@VM_IP. After error from git I've got:

Are you sure you want to preindex and deploy to cchq? [y/N]y
Vault Password for 'cchq':

Sending email: lamp has initiated a CommCare HQ deploy to cchq
Ubuntu 18.04.6 LTS
Enter passphrase for key '/home/lamp/.ssh/id_rsa':
Connection to closed.
commcare-cloud cchq fab deploy_commcare --set code_branch=master --branch master
fab -f /home/lamp/commcare-cloud/src/commcare_cloud/ cchq deploy_commcare --set code_branch=master --disable-known-hosts --system-known-hosts /home/lamp/environments/cchq/known_hosts
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details
Using commcare-hq branch master
[] Executing task 'deploy_commcare'
[] Executing task '_setup_release'
[] Executing task 'create'
[] sudo: mkdir -p /home/cchq/www/cchq/releases/2022-07-08_19.10

Fatal error: Needed to prompt for a connection or sudo password (host:, but input would be ambiguous in parallel mode

!!! Parallel execution exception under host '':
I am not quite sure what cased that error from occurring.

However, you can first try below command to delete the existing release directories and then run the deploy command after that.

cchq $env fab clean_releases

The same error when executing the above command. As I've googled problem might be connected with python Fabric parallel mode execution.
As in documentation Parallel execution — Fabric documentation
And the same error guy got out there ssh - Python Fabric Parallel Execution Failure on EC2: Updated - Stack Overflow

May be some ideas what file to fix to undo the parallel mode execution.

(cchq) lamp@monolith:~/commcare-cloud$ commcare-cloud cchq fab clean_releases
fab -f /home/lamp/commcare-cloud/src/commcare_cloud/ cchq clean_releases --disable-known-hosts --system-known-hosts /home/lamp/environments/cchq/known_hosts
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details
Using commcare-hq branch master
[] Executing task 'clean_releases'
[] Executing task 'clean_releases'
[] sudo: ls /home/cchq/www/cchq/releases

Fatal error: Needed to prompt for a connection or sudo password (host:, but input would be ambiguous in parallel mode

!!! Parallel execution exception under host '':
Here's the connected theme Fabric returns exit code 0 on failed parallel tasks · Issue #572 · fabric/fabric · GitHub

There ara suspicious error line from above output File "/home/lamp/commcare-cloud/src/commcare_cloud/fab/operations/", line 255, in clean_releases releases = sudo('ls {}'.format(env.releases)).split()

Hello Roby,

The exact error is here

paramiko.ssh_exception.PasswordRequiredException: Private key file is encrypted

This means your SSH auth is password protected. Can you try after disabling that?