Update platforme via "cchq <env> deploy" Error

Hello Team CommCare
I need help, please. I'm trying to update our platform using the command 'cchq deploy' from commcare-cloud, but I'm getting this error. If anyone has any ideas, please let me know.


thanks in advance

Hi @Mirado

I asked ChatGPT to parse that image for me, and return the value of the traceback. Here it is:

Traceback (most recent call last):
  File "/home/cchq/www/sic/releases/2024-10-17_09.29/python_env/lib/python3.9/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
    utility.execute()
  File "/home/cchq/www/sic/releases/2024-10-17_09.29/python_env/lib/python3.9/site-packages/django/core/management/__init__.py", line 436, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/cchq/www/sic/releases/2024-10-17_09.29/python_env/lib/python3.9/site-packages/django/core/management/base.py", line 412, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/home/cchq/www/sic/releases/2024-10-17_09.29/python_env/lib/python3.9/site-packages/django/core/management/base.py", line 458, in execute
    output = self.handle(*args, **options)
  File "/home/cchq/www/sic/releases/2024-10-17_09.29/corehq/apps/change_feed/management/commands/validate_kafka_pillow_checkpoints.py", line 36, in handle
    validate_checkpoints(print_only)
  File "/home/cchq/www/sic/releases/2024-10-17_09.29/corehq/apps/change_feed/management/commands/validate_kafka_pillow_checkpoints.py", line 96, in validate_checkpoints
    available_offsets = get_multi_topic_first_available_offsets(checkpoint_dict)
  File "/home/cchq/www/sic/releases/2024-10-17_09.29/corehq/apps/change_feed/topics.py", line 87, in get_multi_topic_first_available_offsets
    return _get_topic_offsets(topics, latest=False)
  File "/home/cchq/www/sic/releases/2024-10-17_09.29/corehq/apps/change_feed/topics.py", line 115, in _get_topic_offsets
    responses = client.send_offset_request(reqs)
  File "/home/cchq/www/sic/releases/2024-10-17_09.29/python_env/lib/python3.9/site-packages/kafka/client.py", line 676, in send_offset_request
    return self._send_request_to_brokers(OffsetRequest, OffsetResponse, requests)
  File "/home/cchq/www/sic/releases/2024-10-17_09.29/python_env/lib/python3.9/site-packages/kafka/client.py", line 677, in _send_request_to_brokers
    if not fail_on_error or not self._raise_on_response_error(resp)
  File "/home/cchq/www/sic/releases/2024-10-17_09.29/python_env/lib/python3.9/site-packages/kafka/client.py", line 403, in _raise_on_response_error
    raise resp
kafka.errors.FailedPayloadError: FailedPayloadError

It looks like you're getting a response error from Kafka. Can you run check-services, and make sure Kafka is healthy?

Hello Norman ,

Sorry for the late reply, the kafka service is working fine but looking a bit in the logs, I have this, Do you have any suggestions to solve this problem? Thank you very much.


Best

Hi @Mirado

I'm happy that the Kafka service is working fine.

So it appears from those logs that Kafka is establishing and then losing its connection to the broker.

Is "Controller 1" running on 172.16.88.205? -- Can we rule out network issues as a possible explanation?

I think we can also rule out the Kafka settings for "listeners" and "advertised.listeners". It seems to connect successfully at 07:01:26,385 and if these settings were wrong, I don't think the controller would be able to establish a connection at all.

Other possibilities are:

  • Insufficient resources (CPU, memory, disk I/O) on the broker or controller can cause the connection to drop.
  • High load on the broker can cause it to be slow in responding, leading to timeouts and dropped connections.
  • Check Zookeeper is running without errors

Hi @Norman_Hooper

Thank you very much for your quick response. Yes, controller 1 works on 172.16.88.205, and we can rule out any network issues as everything is allowed on the network.

This log keeps repeating itself, but occasionally it manages to connect successfully.

I will look into the suggestions you provided and get back to you. Thanks again!

Best

Hi @Norman_Hooper

I've resolved the issue, and we were able to successfully perform a new deployment now using 'cchq deploy'.
There was a corrupted file related to Kafka replication: '/opt/data/kafka/data/replication-offset-checkpoint.tmp (Input/output error)', as shown in the log below, which caused Kafka to shut down and restart.
I deleted the file, and the connection was restored to normal, allowing us to proceed with the deploy.

Thank you so much for your help!

[2024-11-14 08:45:33,314] ERROR [ReplicaManager broker=0] Error while writing to highwatermark file in directory /opt/data/kafka/data (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.KafkaStorageException: Error while writing to checkpoint file /opt/data/kafka/data/replication-offset-checkpoint
Caused by: java.io.FileNotFoundException: /opt/data/kafka/data/replication-offset-checkpoint.tmp (Input/output error)
        at java.io.FileOutputStream.open0(Native Method)
        at java.io.FileOutputStream.open(FileOutputStream.java:270)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:162)

However, after that, I encountered another error in the log stating that the number of available brokers is only 2. Is there a way to manage this?

org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 3 larger than available brokers: 2.
[2024-11-14 14:02:14,471] INFO [Admin Manager on Broker 0]: Error processing create topic request CreatableTopic(name='__consumer_offsets', numPartitions=50, replicationFactor=3, assignments=[], configs=[CreateableTopicConfig(name='compression.type', value='producer'), CreateableTopicConfig(name='cleanup.policy', value='compact'), CreateableTopicConfig(name='segment.bytes', value='104857600')]) (kafka.server.ZkAdminManager)

thanks again
Best