Issues with Changelog 0087 - upgrade to ES 6

I'm following the instructions for changelog 0087 and am running into issues:

In step 4, I updated my public.yml -

ELASTICSEARCH_MAJOR_VERSION: 6 (it was 5)

I also added these lines that were missing under the localsettings: block:

ES_APPS_INDEX_MULTIPLEXED: True
ES_CASE_SEARCH_INDEX_MULTIPLEXED: True
ES_CASES_INDEX_MULTIPLEXED: True
ES_DOMAINS_INDEX_MULTIPLEXED: True
ES_FORMS_INDEX_MULTIPLEXED: True
ES_GROUPS_INDEX_MULTIPLEXED: True
ES_SMS_INDEX_MULTIPLEXED: True
ES_USERS_INDEX_MULTIPLEXED: True
ES_APPS_INDEX_SWAPPED: False
ES_CASE_SEARCH_INDEX_SWAPPED: False
ES_CASES_INDEX_SWAPPED: False
ES_DOMAINS_INDEX_SWAPPED: False
ES_FORMS_INDEX_SWAPPED: False
ES_GROUPS_INDEX_SWAPPED: False
ES_SMS_INDEX_SWAPPED: False
ES_USERS_INDEX_SWAPPED: False

Both step 5 commands aren't running:

cchq django-manage update_config
cchq django-manage restart_services

I assume the first command should be
cchq <env> update-config

For the second command, I used:

cchq <env> downtime start

...followed by:

cchq <env> downtime end

For step 6, the reindex fails for all indices with:

(python_env-3.9) (monolith) cchq@monolith:~/www/monolith/current$ INDEX_CNAME='apps'
(python_env-3.9) (monolith) cchq@monolith:~/www/monolith/current$ ./manage.py elastic_sync_multiplexed start ${INDEX_CNAME}
Traceback (most recent call last):
  File "/home/cchq/www/monolith/releases/2025-05-11_05.59/./manage.py", line 172, in <module>
    main()
  File "/home/cchq/www/monolith/releases/2025-05-11_05.59/./manage.py", line 50, in main
    execute_from_command_line(sys.argv)
  File "/home/cchq/www/monolith/releases/2025-05-11_05.59/python_env/lib/python3.9/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
    utility.execute()
  File "/home/cchq/www/monolith/releases/2025-05-11_05.59/python_env/lib/python3.9/site-packages/django/core/management/__init__.py", line 436, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/cchq/www/monolith/releases/2025-05-11_05.59/python_env/lib/python3.9/site-packages/django/core/management/base.py", line 412, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/home/cchq/www/monolith/releases/2025-05-11_05.59/python_env/lib/python3.9/site-packages/django/core/management/base.py", line 458, in execute
    output = self.handle(*args, **options)
  File "/home/cchq/www/monolith/releases/2025-05-11_05.59/corehq/apps/es/management/commands/elastic_sync_multiplexed.py", line 624, in handle
    cmd_func(
  File "/home/cchq/www/monolith/releases/2025-05-11_05.59/corehq/apps/es/management/commands/elastic_sync_multiplexed.py", line 50, in start_reindex
    raise IndexNotMultiplexedException("""Index not multiplexed!
corehq.apps.es.exceptions.IndexNotMultiplexedException: Index not multiplexed!
            Sync can only be run on multiplexed indices

I recall having this issue previously with the upgrade to ES 5 in Jan 2024. See here:

The response at that time was: "Since you have started the system in November (2023) and around that time the indices were already multiplexed so all data was successfully populated in your secondary indices. It means you don't need to reindex process on this server. You can straightaway go with the ES 5 upgrade process and follow this changelog"

Would that apply here? I just find it odd that if indices are already multiplexed that the error indicated that they are not.

Thanks!

Hey @erobinson!

Hope you are doing good.

I think there was a setting that was missed in the changelog.

Can you add the following setting in your localsettings file?

ES_MULTIPLEX_TO_VERSION: '6'

I think it should fix the issue that you are facing.

In the meantime, I will update the changelog as well.

Thanks for flagging this and apologies for the inconvenience caused.

1 Like

Thanks Amit, I'll let you know as soon as we have a chance to try again.
Ed

Hi Amit, the reindex seemed to kick off but I ran into another issue.
Please take a look at these logs and let me know your thoughts?

This was on the first re-index (apps):

The result of running
cchq <env> run-shell-command elasticsearch "grep '<Task Number>.*ReindexResponse' /opt/data/elasticsearch*/logs/*.log"
From the control machine was:

I then checked the es log using cat and found this:

I tried again after that:

once again the runshell command:

and the es log:

Hi all, I'm just bumping this as we have another upgrade to do this weekend and I'd like to see if we can resolve it. Thanks!

I have the same issue on another server. On the first re-index (apps), I get the following output:

(monolith) cchq@monolith:~$ sudo -iu cchq
(monolith) cchq@monolith:~$ cd /home/cchq/www/monolith/current
(monolith) cchq@monolith:~/www/monolith/current$ source python_env/bin/activate
(commcare-hq) (monolith) cchq@monolith:~/www/monolith/current$ INDEX_CNAME='apps'
(commcare-hq) (monolith) cchq@monolith:~/www/monolith/current$ ./manage.py elastic_sync_multiplexed start ${INDEX_CNAME}
2025-05-17 17:04:03,188 INFO [elastic_sync_multiplexed] Preparing index apps-2024-05-09 for reindex
2025-05-17 17:04:03,260 INFO [elastic_sync_multiplexed] Starting ReIndex process
2025-05-17 17:04:03,278 INFO [elastic_sync_multiplexed] Copying docs from index apps-20230524 to index apps-2024-05-09




2025-05-17 17:04:03,278 INFO [elastic_sync_multiplexed] -----------------IMPORTANT-----------------
2025-05-17 17:04:03,278 INFO [elastic_sync_multiplexed] TASK NUMBER - 1124
2025-05-17 17:04:03,278 INFO [elastic_sync_multiplexed] -------------------------------------------
2025-05-17 17:04:03,278 INFO [elastic_sync_multiplexed] Save this Task Number, You will need it later for verifying your reindex process




Looking for task with ID '1k_6ORgNRK6wPf8kc0_4xA:1124' running on 'es0'
Progress 0.00% (0 / 1527). Elapsed time: 0:00:10. Estimated remaining time: (average since start = unknown) (recent average = )  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:1124
Progress 100.00% (1527 / 1527). Elapsed time: 0:00:20. Estimated remaining time: (average since start = 0:00:00) (recent average = )  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:1124
Progress 100.00% (1527 / 1527). Elapsed time: 0:00:30. Estimated remaining time: (average since start = 0:00:00) (recent average = )  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:1124
Progress 100.00% (1527 / 1527). Elapsed time: 0:00:40. Estimated remaining time: (average since start = 0:00:00) (recent average = )  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:1124
Progress 100.00% (1527 / 1527). Elapsed time: 0:00:50. Estimated remaining time: (average since start = 0:00:00) (recent average = )  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:1124
Progress 100.00% (1527 / 1527). Elapsed time: 0:01:00. Estimated remaining time: (average since start = 0:00:00) (recent average = )  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:1124
Progress 100.00% (1527 / 1527). Elapsed time: 0:01:10. Estimated remaining time: (average since start = 0:00:00) (recent average = )  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:1124
Progress 100.00% (1527 / 1527). Elapsed time: 0:01:20. Estimated remaining time: (average since start = 0:00:00) (recent average = )  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:1124
Progress 100.00% (1527 / 1527). Elapsed time: 0:01:30. Estimated remaining time: (average since start = 0:00:00) (recent average = )  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:1124
Progress 100.00% (1527 / 1527). Elapsed time: 0:01:40. Estimated remaining time: (average since start = 0:00:00) (recent average = )  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:1124
Traceback (most recent call last):
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/./manage.py", line 175, in <module>
    main()
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/./manage.py", line 51, in main
    execute_from_command_line(sys.argv)
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/python_env/lib/python3.9/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
    utility.execute()
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/python_env/lib/python3.9/site-packages/django/core/management/__init__.py", line 436, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/python_env/lib/python3.9/site-packages/django/core/management/base.py", line 412, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/python_env/lib/python3.9/site-packages/django/core/management/base.py", line 458, in execute
    output = self.handle(*args, **options)
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/corehq/apps/es/management/commands/elastic_sync_multiplexed.py", line 624, in handle
    cmd_func(
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/corehq/apps/es/management/commands/elastic_sync_multiplexed.py", line 72, in start_reindex
    check_task_progress(task_id)
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/corehq/apps/es/utils.py", line 101, in check_task_progress
    task_details = manager.get_task(task_id=task_id)
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/corehq/apps/es/client.py", line 154, in get_task
    task_details = self._es.tasks.get(task_id=task_id)
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/python_env/lib/python3.9/site-packages/elasticsearch6/client/utils.py", line 101, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/python_env/lib/python3.9/site-packages/elasticsearch6/client/tasks.py", line 84, in get
    return self.transport.perform_request(
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/python_env/lib/python3.9/site-packages/elasticsearch6/transport.py", line 402, in perform_request
    status, headers_response, data = connection.perform_request(
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/python_env/lib/python3.9/site-packages/elasticsearch6/connection/http_urllib3.py", line 252, in perform_request
    self._raise_error(response.status, raw_data)
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/python_env/lib/python3.9/site-packages/elasticsearch6/connection/base.py", line 253, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
elasticsearch6.exceptions.TransportError: TransportError(503, 'no_shard_available_action_exception', 'No shard available for [get [.tasks][task][1k_6ORgNRK6wPf8kc0_4xA:1124]: routing [null]]')

Any help would be greatly appreciated!

Hey @erobinson!

I was out sick on Friday so did not see your reply then. Apologies for the delayed response. I had a look at the logs that you have shared. This looks like an issue with .tasks index. Elasticsearch uses it to track the status of the task like the reindex. The elastic_sync_multiplexed command used elasticsearch's tasks api to fetch the status of the reindex and display the same status on the shell. And that is what is failing here.

Looking at the log

2025-05-15T04:46:01,932][WARN ][o.e.t.TaskManager        ] [
ode_name][es0]  couldn't store response BulkIndexByScrollResponse[took=1m,timed_out=false,sliceId=null,updated=0,created=1302,deleted=0,batches=2,versionConflicts=0,noops=0,retries=0,throttledUntil=0s,bulk_failures=[],search_failures=[]]

I can see that there were no issues with the reindex itself. It succeeded and created 1302 docs but failed to store the status.

On the second attempt the command ran successfully. It looks like a transient issue with the cluster itself. So you should be good to go.

Confirming that there were no issues in the reindex for the server you shared log in - Issues with Changelog 0087 - upgrade to ES 6 - #4 by erobinson

I am sure you have ensured that there is sufficient disk on the cluster but want to double check on that?

And when you face this issue next time, would you be able to run the following commands and share their output as well?

curl -XGET 'http://<host_ip>:9200/_cluster/health?pretty'
curl -XGET 'http://<host_ip>:9200/_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason' | grep .tasks
curl -XGET 'http://<host_ip>:9200/_cluster/allocation/explain?pretty' -H 'Content-Type: application/json' -d'
{
  "index": ".tasks",
  "shard": 0,
  "primary": true
}'

I want to understand why ES is failing to update the .tasks index.

I'll send another update as soon as I can take the server offline for an hour. In a nutshell, I took a full server disk snapshot after the latest CCHQ was deployed, before I started with 0087 then after failure of 0087, I restored the snapshot to ensure it was in the same state as it was before starting. Effectively I will start with 0087 at next opportunity.

More to come,
Ed

1 Like

Here you go:

(cchq) ccc@monolith:~/commcare-cloud$ curl -XGET 10.2.0.4:9200/_cluster/health?pretty
{
  "cluster_name" : "monolith-es",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 75,
  "active_shards" : 75,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}
(cchq) ccc@monolith:~/commcare-cloud$ curl -XGET 'http://10.2.0.4:9200/_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason' | grep .tasks
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3817  100  3817    0     0  50735      0 --:--:-- --:--:-- --:--:-- 50893
.tasks                 0     p      STARTED es0
(cchq) ccc@monolith:~/commcare-cloud$ curl -XGET 'http://10.2.0.4:9200/_cluster/allocation/explain?pretty' -H 'Content-Type: application/json' -d'
{
  "index": ".tasks",
  "shard": 0,
  "primary": true
}'
{
  "index" : ".tasks",
  "shard" : 0,
  "primary" : true,
  "current_state" : "started",
  "current_node" : {
    "id" : "1k_6ORgNRK6wPf8kc0_4xA",
    "name" : "es0",
    "transport_address" : "10.2.0.4:9300",
    "weight_ranking" : 1
  },
  "can_remain_on_current_node" : "yes",
  "can_rebalance_cluster" : "yes",
  "can_rebalance_to_other_node" : "no",
  "rebalance_explanation" : "cannot rebalance as no target node exists that can both allocate this shard and improve the cluster balance"
}

Thanks @erobinson for sharing these. It looks like that the tasks index is fine and the cluster is also in green which means all the shards are assigned and cluster is healthy.

Did you run these commands right after you get the errors with reindex command? I am still not sure why would you get the errors in .tasks index. I assume it might be due to certain resource constraints on the system? But the good thing is it recovers soon.

Based on your previous updates, I can see that the reindex command succeeds on reruns. And its totally safe to run the reindex command multiple times.

Let me know if this workaround can work for you for now.

Hi Amit, correct, I ran those after I got the errors with the reindex command. I recall seeing something similar before the last time we did some reindexing. It may be to do with the fact that this is a monolith and resources are not dedicated. I will continue with the reindex of all the indices. Is there something I can do to double check that the reindex worked for each?

Thanks again for your help,
Ed

One thing would be running the reindex command again. If it succeeds, then you should be good to go.

Another indicator for a successful reindex would be doc count in both older and newer index.

cchq monolith django-manage elastic_sync_multiplexed display_doc_counts <index_cname>

This should give you similar count. The count may vary slightly for high volume indices (like case-search/forms/cases) if system is getting used but in most cases it should be same.

OK great, thanks for the feedback Amit, it's very helpful!
Ed

1 Like

So the reindex only throws errors for apps and cases (cases throws a different error:

(commcare-hq) (monolith) cchq@monolith:~/www/monolith/current$ INDEX_CNAME='cases'
(commcare-hq) (monolith) cchq@monolith:~/www/monolith/current$ ./manage.py elastic_sync_multiplexed start ${INDEX_CNAME}
2025-05-22 16:45:00,715 INFO [elastic_sync_multiplexed] Preparing index cases-2024-05-09 for reindex
2025-05-22 16:45:00,795 INFO [elastic_sync_multiplexed] Starting ReIndex process
2025-05-22 16:45:00,799 INFO [elastic_sync_multiplexed] Copying docs from index cases-20230524 to index cases-2024-05-09




2025-05-22 16:45:00,799 INFO [elastic_sync_multiplexed] -----------------IMPORTANT-----------------
2025-05-22 16:45:00,799 INFO [elastic_sync_multiplexed] TASK NUMBER - 988
2025-05-22 16:45:00,799 INFO [elastic_sync_multiplexed] -------------------------------------------
2025-05-22 16:45:00,799 INFO [elastic_sync_multiplexed] Save this Task Number, You will need it later for verifying your reindex process




Looking for task with ID '1k_6ORgNRK6wPf8kc0_4xA:988' running on 'es0'
Progress 19.80% (78000 / 393846). Elapsed time: 0:00:10. Estimated remaining time: (average since start = 0:00:40) (recent average = )  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:988
Progress 37.83% (149000 / 393846). Elapsed time: 0:00:20. Estimated remaining time: (average since start = 0:00:32) (recent average = 0:00:34)  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:988
Progress 57.64% (227000 / 393846). Elapsed time: 0:00:30. Estimated remaining time: (average since start = 0:00:22) (recent average = 0:00:22)  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:988
Progress 76.93% (303000 / 393846). Elapsed time: 0:00:40. Estimated remaining time: (average since start = 0:00:12) (recent average = 0:00:12)  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:988
Progress 96.23% (379000 / 393846). Elapsed time: 0:00:50. Estimated remaining time: (average since start = 0:00:01) (recent average = 0:00:01)  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:988
Progress 100.00% (393846 / 393846). Elapsed time: 0:00:52. Estimated remaining time: (average since start = 0:00:00) (recent average = 0:00:00)  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:988



2025-05-22 16:46:02,209 INFO [elastic_sync_multiplexed] Performing required cleanup!
2025-05-22 16:46:02,210 INFO [elastic_sync_multiplexed] Deleting Tombstones From Secondary Index

Doc Count In Old Index 'cases-20230524' - 393846

Doc Count In New Index 'cases-2024-05-09' - 393846

2025-05-22 16:46:02,244 INFO [elastic_sync_multiplexed] Verify this reindex process from elasticsearch logs using task id - 1k_6ORgNRK6wPf8kc0_4xA:988



2025-05-22 16:46:02,245 INFO [elastic_sync_multiplexed] You can use commcare-cloud to extract reindex logs from cluster

        cchq monolith run-shell-command elasticsearch "grep '988.*ReindexResponse' /opt/data/elasticsearch*/logs/*.log"


Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/python_env/lib/python3.9/site-packages/gevent/thread.py", line 276, in acquire
    acquired = BoundedSemaphore.acquire(self, blocking, timeout)
  File "src/gevent/_semaphore.py", line 184, in gevent._gevent_c_semaphore.Semaphore.acquire
  File "src/gevent/_semaphore.py", line 263, in gevent._gevent_c_semaphore.Semaphore.acquire
  File "src/gevent/_semaphore.py", line 253, in gevent._gevent_c_semaphore.Semaphore.acquire
  File "src/gevent/_abstract_linkable.py", line 521, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait
  File "src/gevent/_abstract_linkable.py", line 487, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait_core
  File "src/gevent/_abstract_linkable.py", line 490, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait_core
  File "src/gevent/_abstract_linkable.py", line 442, in gevent._gevent_c_abstract_linkable.AbstractLinkable._AbstractLinkable__wait_to_be_notified
  File "src/gevent/_abstract_linkable.py", line 451, in gevent._gevent_c_abstract_linkable.AbstractLinkable._switch_to_hub
  File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._greenlet_switch
gevent.exceptions.LoopExit: This operation would block forever
        Hub: <Hub '' at 0x79469736a9a0 epoll default pending=0 ref=0 fileno=4 resolver=<gevent.resolver.thread.Resolver at 0x794683513970 pool=<ThreadPool at 0x794683556c80 tasks=0 size=1 maxsize=10 hub=<Hub at 0x79469736a9a0 thread_ident=0x79469be661c0>>> threadpool=<ThreadPool at 0x794683556c80 tasks=0 size=1 maxsize=10 hub=<Hub at 0x79469736a9a0 thread_ident=0x79469be661c0>> thread_ident=0x79469be661c0>
        Handles:
[]
  • they both throw the error after reaching the 100% mark and seem to run through fine the second time around.)

The rest seem to run through fine, but another observation is running the command to check the logs consistently produces the same error:

(cchq) ccc@monolith:~/commcare-cloud$ cchq monolith run-shell-command elasticsearch "grep '47205.*ReindexResponse' /opt/data/elasticsearch*/logs/*.log" ansible elasticsearch -m shell -i /home/ccc/environments/monolith/inventory.ini -a 'grep '"'"'47205.*ReindexResponse'"'"' /opt/data/elasticsearch*/logs/*.log' -u ansible '--ssh-common-args=-o UserKnownHostsFile=/home/ccc/environments/monolith/known_hosts' --diff 10.2.0.4 | FAILED | rc=1 >> non-zero return code ✗ Apply failed with status code 2


The rest of the upgrade appeared to go well until the ES v6 deploy. The deploy_stack playbook gets to TASK [elasticsearch : Check that Elasticsearch is up after 20 seconds] and fails with timeout waiting for ES to restart.

/opt/data is only at 65% use after ES backups. I'll post full logs in the morning, thanks.

Apologies that the process has not been smooth for you, Ed!

  • they both throw the error after reaching the 100% mark and seem to run through fine the second time around.)

It is fine to proceed if they succeed second time.

(cchq) ccc@monolith:~/commcare-cloud$ cchq monolith run-shell-command elasticsearch "grep '47205.ReindexResponse' /opt/data/elasticsearch/logs/.log" ansible elasticsearch -m shell -i /home/ccc/environments/monolith/inventory.ini -a 'grep '"'"'47205.ReindexResponse'"'"' /opt/data/elasticsearch/logs/.log' -u ansible '--ssh-common-args=-o UserKnownHostsFile=/home/ccc/environments/monolith/known_hosts' --diff 10.2.0.4 | FAILED | rc=1 >> non-zero return code ✗ Apply failed with status code 2

Can you confirm that /opt/data/elasticsearch*/logs/*.log exists, you can also try to put the specific values of the log files. And then grep for "47205.*ReindexResponse" in that path.

TASK [elasticsearch : Check that Elasticsearch is up after 20 seconds] and fails with timeout waiting for ES to restart.

Can you share elasticsearch logs for the time it failed to start up?

Apologies for the delay, I created a mirror of the server in question and am working on that.
First up is the ES log for the initial 'apps' index sync that seems to consistently fail in case it's useful:

After creating the new ES indices etc. in the Upgrade Steps section at item 4 I expected output detailing the elasticsearch_version and elasticsearch_download_sha256 updates but it doesn't seem to feature during the output - not sure if that's normal (it may be):

Now the main issue is starting ES after installation - at item 9 where it deploys the new ES and starts it up, here are the logs directly after running:

cchq ${ENV} ansible-playbook deploy_stack.yml --limit=elasticsearch --tags=elasticsearch

Part1:

Part2:

Part3:

The main takeaway that I can see from those is:
JNA is not available

Would that be related to executable permissions on /tmp perhaps?

Thanks

First up is the ES log for the initial 'apps' index sync that seems to consistently fail

It is because of the issue we discussed above, and I have no clue why ES will fail setting details in its internal index.

I expected output detailing the elasticsearch_version and elasticsearch_download_sha256 updates but it doesn't seem to feature during the output - not sure if that's normal (it may be):

You should see a change with ELASTICSEARCH_MAJOR_VERSION: 6, if you have already ran update-config once then you will not see this change.
elasticsearch_version and elasticsearch_download_sha256 are used in deploy-stack command so I think the output you are seeing is expected if update-config is already run once.

Would that be related to executable permissions on /tmp perhaps?

Certainly, looks like it - /opt/data/ is mounted with noexec

You can verify it by -

mount | grep /opt/data

Look for noexec in the output. From your logs, I can see it's mounted on /dev/sdb1 as ext4.

If you see it then one fix can be editing your fstab file and removing noexec from the disk config-

sudo nano /etc/fstab

Find the line for /dev/sdb1 and remove 'noexec' from the options column.

Let me know if this works or you need any other information on it.

1 Like

Hi, no it appears to be mounted like this:


(cchq) ccc@monolith:~/commcare-cloud$ mount | grep /opt/data
/dev/sdb1 on /opt/data type ext4 (rw,relatime)

fstab looks like this:


UUID=21ddeaa9-4e80-47ee-8552-89c7cc23b051 /opt/data    ext4    defaults    0    2

The content of /opt/data look as follows:

(cchq) ccc@monolith:~/commcare-cloud$ ls /opt/data -lt
total 1194076
drwxr-xr-x 3 redis         root                4096 Jun  4 13:57 redis
drwxr-xr-x 5 elasticsearch elasticsearch       4096 May 29 13:06 elasticsearch-6.8.23
drwxrwxr-x 3 nobody        nfs            148914176 May 29 12:10 blobdb
drwxr-xr-x 4 root          root                4096 May 17 12:47 home
drwxr-xr-x 5 cchq          cchq                4096 Jun  7  2024 formplayer
drwxr-xr-x 4 elasticsearch elasticsearch       4096 Feb  2  2024 elasticsearch-5.6.16-new-installation
drwxrwxrwx 6 root          root                4096 Feb  2  2024 backups
drwxr-x--- 5 couchdb       couchdb             4096 Aug  9  2023 couchdb2
drwxr-xr-x 3 kafka         kafka               4096 Aug  9  2023 kafka
drwxr-xr-x 4 elasticsearch elasticsearch       4096 Aug  9  2023 elasticsearch-5.6.16
drwxr-xr-x 4 elasticsearch elasticsearch       4096 Aug  9  2023 elasticsearch-5.6.16-backup
-rw-r--r-- 1 root          root          1073741824 Aug  9  2023 emerg_delete.dummy
drwx------ 3 postgres      postgres            4096 Aug  9  2023 postgresql
-rwxrwx--- 1 root          root                  55 Aug  9  2023 README

Thanks
Ed

Thanks for sharing these details, @erobinson.

Looks like JVM is not able to access the tmp directory that we are providing it.

I have made some changes in commcare cloud to make that setting configurable.

Can you pull the latest version commcare cloud and add update your environment's public.yml file -

elasticsearch_jvm_tmp_dir : '/tmp'

I think jvm should be able to access /tmp directory to create the jna files.

Lets try that and if it did not work we can try changing the directory.

Out of curiousity - Did you face this problem on all of your servers or just one?

Thanks Amit, I'll test and revert ASAP. Unfortunately I only got to test one as we'd put the update on hold on the other until we resolved it here. Given they're identically set up (same infrastructure provider and OS image), I imagine it will be the same.

1 Like