Hello Team CommCare
I need help, please. I'm trying to update our platform using the command 'cchq deploy' from commcare-cloud, but I'm getting this error. If anyone has any ideas, please let me know.
thanks in advance
Hello Team CommCare
I need help, please. I'm trying to update our platform using the command 'cchq deploy' from commcare-cloud, but I'm getting this error. If anyone has any ideas, please let me know.
Hi @Mirado
I asked ChatGPT to parse that image for me, and return the value of the traceback. Here it is:
Traceback (most recent call last):
File "/home/cchq/www/sic/releases/2024-10-17_09.29/python_env/lib/python3.9/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
utility.execute()
File "/home/cchq/www/sic/releases/2024-10-17_09.29/python_env/lib/python3.9/site-packages/django/core/management/__init__.py", line 436, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/home/cchq/www/sic/releases/2024-10-17_09.29/python_env/lib/python3.9/site-packages/django/core/management/base.py", line 412, in run_from_argv
self.execute(*args, **cmd_options)
File "/home/cchq/www/sic/releases/2024-10-17_09.29/python_env/lib/python3.9/site-packages/django/core/management/base.py", line 458, in execute
output = self.handle(*args, **options)
File "/home/cchq/www/sic/releases/2024-10-17_09.29/corehq/apps/change_feed/management/commands/validate_kafka_pillow_checkpoints.py", line 36, in handle
validate_checkpoints(print_only)
File "/home/cchq/www/sic/releases/2024-10-17_09.29/corehq/apps/change_feed/management/commands/validate_kafka_pillow_checkpoints.py", line 96, in validate_checkpoints
available_offsets = get_multi_topic_first_available_offsets(checkpoint_dict)
File "/home/cchq/www/sic/releases/2024-10-17_09.29/corehq/apps/change_feed/topics.py", line 87, in get_multi_topic_first_available_offsets
return _get_topic_offsets(topics, latest=False)
File "/home/cchq/www/sic/releases/2024-10-17_09.29/corehq/apps/change_feed/topics.py", line 115, in _get_topic_offsets
responses = client.send_offset_request(reqs)
File "/home/cchq/www/sic/releases/2024-10-17_09.29/python_env/lib/python3.9/site-packages/kafka/client.py", line 676, in send_offset_request
return self._send_request_to_brokers(OffsetRequest, OffsetResponse, requests)
File "/home/cchq/www/sic/releases/2024-10-17_09.29/python_env/lib/python3.9/site-packages/kafka/client.py", line 677, in _send_request_to_brokers
if not fail_on_error or not self._raise_on_response_error(resp)
File "/home/cchq/www/sic/releases/2024-10-17_09.29/python_env/lib/python3.9/site-packages/kafka/client.py", line 403, in _raise_on_response_error
raise resp
kafka.errors.FailedPayloadError: FailedPayloadError
It looks like you're getting a response error from Kafka. Can you run check-services, and make sure Kafka is healthy?
Hello Norman ,
Sorry for the late reply, the kafka service is working fine but looking a bit in the logs, I have this, Do you have any suggestions to solve this problem? Thank you very much.
Hi @Mirado
I'm happy that the Kafka service is working fine.
So it appears from those logs that Kafka is establishing and then losing its connection to the broker.
Is "Controller 1" running on 172.16.88.205? -- Can we rule out network issues as a possible explanation?
I think we can also rule out the Kafka settings for "listeners" and "advertised.listeners". It seems to connect successfully at 07:01:26,385 and if these settings were wrong, I don't think the controller would be able to establish a connection at all.
Other possibilities are:
Thank you very much for your quick response. Yes, controller 1 works on 172.16.88.205, and we can rule out any network issues as everything is allowed on the network.
This log keeps repeating itself, but occasionally it manages to connect successfully.
I will look into the suggestions you provided and get back to you. Thanks again!
Best
I've resolved the issue, and we were able to successfully perform a new deployment now using 'cchq deploy'.
There was a corrupted file related to Kafka replication: '/opt/data/kafka/data/replication-offset-checkpoint.tmp (Input/output error)', as shown in the log below, which caused Kafka to shut down and restart.
I deleted the file, and the connection was restored to normal, allowing us to proceed with the deploy.
Thank you so much for your help!
[2024-11-14 08:45:33,314] ERROR [ReplicaManager broker=0] Error while writing to highwatermark file in directory /opt/data/kafka/data (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.KafkaStorageException: Error while writing to checkpoint file /opt/data/kafka/data/replication-offset-checkpoint
Caused by: java.io.FileNotFoundException: /opt/data/kafka/data/replication-offset-checkpoint.tmp (Input/output error)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
However, after that, I encountered another error in the log stating that the number of available brokers is only 2. Is there a way to manage this?
org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 3 larger than available brokers: 2.
[2024-11-14 14:02:14,471] INFO [Admin Manager on Broker 0]: Error processing create topic request CreatableTopic(name='__consumer_offsets', numPartitions=50, replicationFactor=3, assignments=[], configs=[CreateableTopicConfig(name='compression.type', value='producer'), CreateableTopicConfig(name='cleanup.policy', value='compact'), CreateableTopicConfig(name='segment.bytes', value='104857600')]) (kafka.server.ZkAdminManager)
thanks again
Best