Issue with Kafka service not starting

erobinson · June 14, 2022, 8:23am

I'm unable to get the Kafka service up on a server after it ran out of space.
This particular server was configured with the /opt/data directory being on the same drive as the OS.
When it ran out of space, I, as a temporary measure, copied the contents /opt/data (encrypted files) to a new drive mounted elsewhere, renamed /opt/data to /opt/data.old, cleared up some space, and finally created a directory symlink in /opt/data to the new location. I then ran after_reboot all and the system seemed to come up fine except for Kafka. Not sure if the Kafka issue is related, but I'm assuming so.

my Kafka server.log looks like this at the tail:

[2022-06-14 08:06:24,870] INFO Property num.network.threads is overridden to 2 (kafka.utils.VerifiableProperties)
[2022-06-14 08:06:24,871] INFO Property num.partitions is overridden to 1 (kafka.utils.VerifiableProperties)
[2022-06-14 08:06:24,871] INFO Property port is overridden to 9092 (kafka.utils.VerifiableProperties)
[2022-06-14 08:06:24,871] INFO Property socket.receive.buffer.bytes is overridden to 1048576 (kafka.utils.VerifiableProperties)
[2022-06-14 08:06:24,872] INFO Property socket.request.max.bytes is overridden to 104857600 (kafka.utils.VerifiableProperties)
[2022-06-14 08:06:24,872] INFO Property socket.send.buffer.bytes is overridden to 1048576 (kafka.utils.VerifiableProperties)
[2022-06-14 08:06:24,873] INFO Property zookeeper.connect is overridden to xxx.xxx.xxx.xxx:2181 (kafka.utils.VerifiableProperties)
[2022-06-14 08:06:24,873] INFO Property zookeeper.connection.timeout.ms is overridden to 1000000 (kafka.utils.VerifiableProperties)
[2022-06-14 08:06:24,945] INFO [Kafka Server 0], starting (kafka.server.KafkaServer)
[2022-06-14 08:06:24,950] INFO [Kafka Server 0], Connecting to zookeeper on xxx.xxx.xxx.xxx:2181 (kafka.server.KafkaServer)

The Kafka server.out log has a series of these messages being inserted every second or so:

[2022-06-14 08:11:58,870] INFO Opening socket connection to server xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2022-06-14 08:11:58,870] WARN Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)

The controller.log file shows no new logs for today
I imagine this is a zookeeper issue - where should I be looking for those logs?

Any help is appreciated!

EDIT trying some of the suggestions here:

Running this:

cd /opt/kafka/bin
./zookeeper-shell.sh localhost:2181

Results in a series of errors:

Connecting to localhost:2181
Welcome to ZooKeeper!
JLine support is disabled
[2022-06-14 08:28:28,383] WARN Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
[2022-06-14 08:28:29,485] WARN Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
[2022-06-14 08:28:30,590] WARN Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
... etc

Output of lsof -i -P -n | grep LISTEN indicates neither zookeeper or kafka are listening for incoming connections.

jmiller · June 15, 2022, 6:52pm

Hi Ed, I'll take a stab at this from the "OS side of things". I'm not super familiar with the CommCare HQ Kafka/Zookeeper architecture details, so I'm afraid I won't be much help there.

I'm not seeing anything in the logs you've provided that jump out to me regarding an "operating system issue", but I think it's worth trying to rule out any issues on that front. Based on your description, it think the following outlines the sequence of events:

disk ran out of space
copied contents of /opt/data to <other_device_path>
renamed /opt/data to /opt/data.old
cleared up some space
created a directory symlink in /opt/data

My first disclaimer here is that it's entirely possible that when the disk ran out of space, your /opt/data content got corrupted in some way and that corruption is the ultimate problem here. It's impossible for me to know that, but I just want to point out that that is a possibility.

My following questions and inquires pertain to OS file permissions, and I don't know what user your kafka service runs as (my naive assumption is it runs as the cchq user, but I'm not sure about this). In any case, you will need to know this information in order determine if an any part of your problem is related to filesystem permissions.

My first inquiry pertains to how you copied the data in step #2 above. Depending on the method of copy (which command was used (e.g. cp, rsync, tar, zip, cpio, other), did you use sudo, etc), it's possible that the destination files got written with new file owners and/or file modes. If this happened, then it's possible that the kafka service simply cannot read the files at their new location. If your /opt/date.old directory (and its unmodified contents) are still hanging around, a quick sanity check would be to ensure the file owners and modes are identical between /opt/data.old/... and <other_device_path>/.... If any of the permissions or modes differ, this would be something to look into.

My next question is: does the user that your kafka service runs as have at least read/execute permission on all the directories that make up <other_device_path>? For example, if your new kafka data path (the one you've symlinked to in step #5) is located at /opt/external/new_disk, the user that kafka runs as will need at least read/execute permissions on all of:

/opt
/opt/external
/opt/external/new_disk

A good "can it at least see the files?" sanity check is to execute a find command as the kafka user to confirm that it can at least list all the filesystem objects that live inside the directory that your symlink points to. You could try something like:

sudo -u <your_kafka_service_user> find /opt/data/ | wc -l

(Note the trailing / on /opt/date/ above. That's necessary in order for the find command to traverse into the directory the symlink points to.)

If that user cannot traverse all the directory contents, the find command will emit "Permission denied" errors for each directory it cannot traverse.

You can then run the command again as root to verify that the kafka user was able to list all the same files that root can. For example:

sudo find /opt/data/ | wc -l
# ^ the number output by this command should match what you got for the kafka user

I would be happy to help if this turns up anything useful for you.

cellowitz · June 15, 2022, 7:45pm

Joel had a lot of great suggestions for the OS side of things, so I thought I would try to tackle the service side. Based on the errors in your messages it looks like zookeeper may not be running, or at least isn't running correctly, and possibly kafka is failing to start as well. You can verify with ps whether kafka or zookeeper are running, but if they are I would fully stop them first.

Then I would try starting zookeeper manually with /opt/zookeeper/bin/zkServer.sh start /opt/zookeeper/conf/zoo.cfg and looking at the logs in /opt/zookeeper/logs/ or /opt/zookeeper/bin/zkServer.sh start-foreground /opt/zookeeper/conf/zoo.cfg to try to get any error messages directly to the console. I think you will have to run that as root. Hopefully, if that does not succeed the error message will point in a useful direction. If you are able to start zookeeper without issue, you can try something similar for kafka. That command would be /opt/kafka/bin/kafka-server-start.sh /etc/kafka/config/server.properties. If that does not succeed, there should be useful info in the kafka logs, but based on the current log messages you have shared, I would bet the issue is with zookeeper and if you are able to successfully start that, kafka should start up fine.

erobinson · June 17, 2022, 10:36am

Hi @jmiller , thanks for the quick response. Regarding possible corruption, I have my doubts in that the files that were copied were all the encrypted files and the after_reboot decryption process seemed to work fine - I imagine that would have failed if one of the files was an issue, but I could be wrong there.

Re: permissions,
all files in /opt/kafka/bin (recursively) are owned by root:root
All files in /opt/data/kafka (recursively) are owned by kafka:kafka
I've checked against another working install with the same setup and both are the same.

The copy command used was:
sudo cp -a /opt/data/. /myData/opt/data/
The intention was to copy everything as is, recursively, preserving attributes
I've checked and it appears everything was preserved.

On the kafka user, given that the scripts are owned by root:root (as confirmed on another working server), I imagine they are all run as root.

erobinson · June 17, 2022, 11:20am

Thanks @cellowitz for your response too, I see zkServer.sh on my system is located in /usr/share/zookeeper/bin/ (with root:root ownership) and the zoo.cfg appears to be under /etc/zookeeper/conf/zoo.cfg
running sudo /usr/share/zookeeper/bin/zkServer.sh start-foreground /etc/zookeeper/conf/zoo.cfg produces no significant output:

ZooKeeper JMX enabled by default Using config: /etc/zookeeper/conf/zoo.cfg

It then returns to the prompt.
After running, I get this in the zookeeper logs at /var/log/zookeeper/zookeeper.log

2022-06-17 11:29:09,441 - INFO  [main:QuorumPeerConfig@134] - Reading configuration from: /etc/zookeeper/conf/zoo.cfg
2022-06-17 11:29:09,453 - INFO  [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 3
2022-06-17 11:29:09,453 - INFO  [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 0
2022-06-17 11:29:09,454 - INFO  [main:DatadirCleanupManager@101] - Purge task is not scheduled.
2022-06-17 11:29:09,455 - WARN  [main:QuorumPeerMain@113] - Either no config or no quorum defined in config, running  in standalone mode
2022-06-17 11:29:09,479 - INFO  [main:QuorumPeerConfig@134] - Reading configuration from: /etc/zookeeper/conf/zoo.cfg
2022-06-17 11:29:09,480 - INFO  [main:ZooKeeperServerMain@96] - Starting server
2022-06-17 11:29:09,497 - INFO  [main:Environment@100] - Server environment:zookeeper.version=3.4.10-3--1, built on Sat, 03 Feb 2018 14:58:02 -0800
2022-06-17 11:29:09,497 - INFO  [main:Environment@100] - Server environment:host.name=xxx.xxx.xxx
2022-06-17 11:29:09,498 - INFO  [main:Environment@100] - Server environment:java.version=1.8.0_312
2022-06-17 11:29:09,498 - INFO  [main:Environment@100] - Server environment:java.vendor=Private Build
2022-06-17 11:29:09,499 - INFO  [main:Environment@100] - Server environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre
2022-06-17 11:29:09,499 - INFO  [main:Environment@100] - Server environment:java.class.path=/etc/zookeeper/conf:/usr/share/java/jline.jar:/usr/share/java/log4j-1.2.jar:/usr/share/java/xercesImpl.jar:/usr/share/java/xmlParserAPIs.jar:/usr/share/java/netty.jar:/usr/share/java/slf4j-api.jar:/usr/share/java/slf4j-log4j12.jar:/usr/share/java/zookeeper.jar
2022-06-17 11:29:09,499 - INFO  [main:Environment@100] - Server environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
2022-06-17 11:29:09,505 - INFO  [main:Environment@100] - Server environment:java.io.tmpdir=/tmp
2022-06-17 11:29:09,510 - INFO  [main:Environment@100] - Server environment:java.compiler=<NA>
2022-06-17 11:29:09,512 - INFO  [main:Environment@100] - Server environment:os.name=Linux
2022-06-17 11:29:09,512 - INFO  [main:Environment@100] - Server environment:os.arch=amd64
2022-06-17 11:29:09,512 - INFO  [main:Environment@100] - Server environment:os.version=4.15.0-187-generic
2022-06-17 11:29:09,513 - INFO  [main:Environment@100] - Server environment:user.name=root
2022-06-17 11:29:09,513 - INFO  [main:Environment@100] - Server environment:user.home=/root
2022-06-17 11:29:09,513 - INFO  [main:Environment@100] - Server environment:user.dir=/opt
2022-06-17 11:29:09,521 - INFO  [main:ZooKeeperServer@829] - tickTime set to 2000
2022-06-17 11:29:09,521 - INFO  [main:ZooKeeperServer@838] - minSessionTimeout set to -1
2022-06-17 11:29:09,522 - INFO  [main:ZooKeeperServer@847] - maxSessionTimeout set to -1
2022-06-17 11:29:09,538 - INFO  [main:NIOServerCnxnFactory@89] - binding to port 0.0.0.0/0.0.0.0:2181
2022-06-17 11:29:09,892 - ERROR [main:ZooKeeperServerMain@64] - Unexpected exception, exiting abnormally
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)
        at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:585)
        at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:604)
        at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:570)
        at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:652)
        at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:166)
        at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
        at org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:283)
        at org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:410)
        at org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:118)
        at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:119)
        at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:87)
        at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:53)
        at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
        at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)

...which appears to be the same output after running a cchq monolith service kafka restart
At this stage I'm going to try redeploying to see if that resolves it.

cellowitz · June 17, 2022, 4:30pm

Hey Ed,

Thanks for sharing the log messages. Based on the message, and your description of having filled up the disk, I suspect you are experiencing this known zookeeper issue https://issues.apache.org/jira/browse/ZOOKEEPER-1621 (or https://issues.apache.org/jira/browse/ZOOKEEPER-2332). It sounds like you should be able to find and delete the offending log file (in the data sense) or snapshot in your zookeeper data directory. That is likely /var/lib/zookeeper although is configurable in commcare-cloud. This stackoverflow answer also had some useful info.

Cal

erobinson · June 17, 2022, 4:59pm

Bril. There was a 0 byte file in the /var/lib/zookeeper/version-2 directory. I moved it and Kafka / zookeeper was able to restart!
Thank you @cellowitz @jmiller for your help, a big win on a Friday.