Issues with Changelog 0087 - upgrade to ES 6

erobinson · May 11, 2025, 8:07am

I'm following the instructions for changelog 0087 and am running into issues:

In step 4, I updated my public.yml -

ELASTICSEARCH_MAJOR_VERSION: 6 (it was 5)

I also added these lines that were missing under the localsettings: block:

ES_APPS_INDEX_MULTIPLEXED: True
ES_CASE_SEARCH_INDEX_MULTIPLEXED: True
ES_CASES_INDEX_MULTIPLEXED: True
ES_DOMAINS_INDEX_MULTIPLEXED: True
ES_FORMS_INDEX_MULTIPLEXED: True
ES_GROUPS_INDEX_MULTIPLEXED: True
ES_SMS_INDEX_MULTIPLEXED: True
ES_USERS_INDEX_MULTIPLEXED: True
ES_APPS_INDEX_SWAPPED: False
ES_CASE_SEARCH_INDEX_SWAPPED: False
ES_CASES_INDEX_SWAPPED: False
ES_DOMAINS_INDEX_SWAPPED: False
ES_FORMS_INDEX_SWAPPED: False
ES_GROUPS_INDEX_SWAPPED: False
ES_SMS_INDEX_SWAPPED: False
ES_USERS_INDEX_SWAPPED: False

Both step 5 commands aren't running:

cchq django-manage update_config
cchq django-manage restart_services

I assume the first command should be
cchq <env> update-config

For the second command, I used:

cchq <env> downtime start

...followed by:

cchq <env> downtime end

For step 6, the reindex fails for all indices with:

(python_env-3.9) (monolith) cchq@monolith:~/www/monolith/current$ INDEX_CNAME='apps'
(python_env-3.9) (monolith) cchq@monolith:~/www/monolith/current$ ./manage.py elastic_sync_multiplexed start ${INDEX_CNAME}
Traceback (most recent call last):
  File "/home/cchq/www/monolith/releases/2025-05-11_05.59/./manage.py", line 172, in <module>
    main()
  File "/home/cchq/www/monolith/releases/2025-05-11_05.59/./manage.py", line 50, in main
    execute_from_command_line(sys.argv)
  File "/home/cchq/www/monolith/releases/2025-05-11_05.59/python_env/lib/python3.9/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
    utility.execute()
  File "/home/cchq/www/monolith/releases/2025-05-11_05.59/python_env/lib/python3.9/site-packages/django/core/management/__init__.py", line 436, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/cchq/www/monolith/releases/2025-05-11_05.59/python_env/lib/python3.9/site-packages/django/core/management/base.py", line 412, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/home/cchq/www/monolith/releases/2025-05-11_05.59/python_env/lib/python3.9/site-packages/django/core/management/base.py", line 458, in execute
    output = self.handle(*args, **options)
  File "/home/cchq/www/monolith/releases/2025-05-11_05.59/corehq/apps/es/management/commands/elastic_sync_multiplexed.py", line 624, in handle
    cmd_func(
  File "/home/cchq/www/monolith/releases/2025-05-11_05.59/corehq/apps/es/management/commands/elastic_sync_multiplexed.py", line 50, in start_reindex
    raise IndexNotMultiplexedException("""Index not multiplexed!
corehq.apps.es.exceptions.IndexNotMultiplexedException: Index not multiplexed!
            Sync can only be run on multiplexed indices

I recall having this issue previously with the upgrade to ES 5 in Jan 2024. See here:

The response at that time was: "Since you have started the system in November (2023) and around that time the indices were already multiplexed so all data was successfully populated in your secondary indices. It means you don't need to reindex process on this server. You can straightaway go with the ES 5 upgrade process and follow this changelog"

Would that apply here? I just find it odd that if indices are already multiplexed that the error indicated that they are not.

Thanks!

aphulera · May 12, 2025, 4:15am

Hey @erobinson!

Hope you are doing good.

I think there was a setting that was missed in the changelog.

Can you add the following setting in your localsettings file?

ES_MULTIPLEX_TO_VERSION: '6'

I think it should fix the issue that you are facing.

In the meantime, I will update the changelog as well.

Thanks for flagging this and apologies for the inconvenience caused.

erobinson · May 12, 2025, 9:26am

Thanks Amit, I'll let you know as soon as we have a chance to try again.
Ed

erobinson · May 15, 2025, 5:54am

Hi Amit, the reindex seemed to kick off but I ran into another issue.
Please take a look at these logs and let me know your thoughts?

This was on the first re-index (apps):

The result of running
cchq <env> run-shell-command elasticsearch "grep '<Task Number>.*ReindexResponse' /opt/data/elasticsearch*/logs/*.log"
From the control machine was:

I then checked the es log using cat and found this:

I tried again after that:

once again the runshell command:

and the es log:

erobinson · May 16, 2025, 12:34pm

Hi all, I'm just bumping this as we have another upgrade to do this weekend and I'd like to see if we can resolve it. Thanks!

erobinson · May 17, 2025, 5:10pm

I have the same issue on another server. On the first re-index (apps), I get the following output:

(monolith) cchq@monolith:~$ sudo -iu cchq
(monolith) cchq@monolith:~$ cd /home/cchq/www/monolith/current
(monolith) cchq@monolith:~/www/monolith/current$ source python_env/bin/activate
(commcare-hq) (monolith) cchq@monolith:~/www/monolith/current$ INDEX_CNAME='apps'
(commcare-hq) (monolith) cchq@monolith:~/www/monolith/current$ ./manage.py elastic_sync_multiplexed start ${INDEX_CNAME}
2025-05-17 17:04:03,188 INFO [elastic_sync_multiplexed] Preparing index apps-2024-05-09 for reindex
2025-05-17 17:04:03,260 INFO [elastic_sync_multiplexed] Starting ReIndex process
2025-05-17 17:04:03,278 INFO [elastic_sync_multiplexed] Copying docs from index apps-20230524 to index apps-2024-05-09




2025-05-17 17:04:03,278 INFO [elastic_sync_multiplexed] -----------------IMPORTANT-----------------
2025-05-17 17:04:03,278 INFO [elastic_sync_multiplexed] TASK NUMBER - 1124
2025-05-17 17:04:03,278 INFO [elastic_sync_multiplexed] -------------------------------------------
2025-05-17 17:04:03,278 INFO [elastic_sync_multiplexed] Save this Task Number, You will need it later for verifying your reindex process




Looking for task with ID '1k_6ORgNRK6wPf8kc0_4xA:1124' running on 'es0'
Progress 0.00% (0 / 1527). Elapsed time: 0:00:10. Estimated remaining time: (average since start = unknown) (recent average = )  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:1124
Progress 100.00% (1527 / 1527). Elapsed time: 0:00:20. Estimated remaining time: (average since start = 0:00:00) (recent average = )  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:1124
Progress 100.00% (1527 / 1527). Elapsed time: 0:00:30. Estimated remaining time: (average since start = 0:00:00) (recent average = )  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:1124
Progress 100.00% (1527 / 1527). Elapsed time: 0:00:40. Estimated remaining time: (average since start = 0:00:00) (recent average = )  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:1124
Progress 100.00% (1527 / 1527). Elapsed time: 0:00:50. Estimated remaining time: (average since start = 0:00:00) (recent average = )  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:1124
Progress 100.00% (1527 / 1527). Elapsed time: 0:01:00. Estimated remaining time: (average since start = 0:00:00) (recent average = )  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:1124
Progress 100.00% (1527 / 1527). Elapsed time: 0:01:10. Estimated remaining time: (average since start = 0:00:00) (recent average = )  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:1124
Progress 100.00% (1527 / 1527). Elapsed time: 0:01:20. Estimated remaining time: (average since start = 0:00:00) (recent average = )  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:1124
Progress 100.00% (1527 / 1527). Elapsed time: 0:01:30. Estimated remaining time: (average since start = 0:00:00) (recent average = )  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:1124
Progress 100.00% (1527 / 1527). Elapsed time: 0:01:40. Estimated remaining time: (average since start = 0:00:00) (recent average = )  Task ID: 1k_6ORgNRK6wPf8kc0_4xA:1124
Traceback (most recent call last):
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/./manage.py", line 175, in <module>
    main()
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/./manage.py", line 51, in main
    execute_from_command_line(sys.argv)
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/python_env/lib/python3.9/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
    utility.execute()
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/python_env/lib/python3.9/site-packages/django/core/management/__init__.py", line 436, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/python_env/lib/python3.9/site-packages/django/core/management/base.py", line 412, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/python_env/lib/python3.9/site-packages/django/core/management/base.py", line 458, in execute
    output = self.handle(*args, **options)
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/corehq/apps/es/management/commands/elastic_sync_multiplexed.py", line 624, in handle
    cmd_func(
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/corehq/apps/es/management/commands/elastic_sync_multiplexed.py", line 72, in start_reindex
    check_task_progress(task_id)
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/corehq/apps/es/utils.py", line 101, in check_task_progress
    task_details = manager.get_task(task_id=task_id)
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/corehq/apps/es/client.py", line 154, in get_task
    task_details = self._es.tasks.get(task_id=task_id)
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/python_env/lib/python3.9/site-packages/elasticsearch6/client/utils.py", line 101, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/python_env/lib/python3.9/site-packages/elasticsearch6/client/tasks.py", line 84, in get
    return self.transport.perform_request(
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/python_env/lib/python3.9/site-packages/elasticsearch6/transport.py", line 402, in perform_request
    status, headers_response, data = connection.perform_request(
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/python_env/lib/python3.9/site-packages/elasticsearch6/connection/http_urllib3.py", line 252, in perform_request
    self._raise_error(response.status, raw_data)
  File "/home/cchq/www/monolith/releases/2025-05-17_13.10/python_env/lib/python3.9/site-packages/elasticsearch6/connection/base.py", line 253, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
elasticsearch6.exceptions.TransportError: TransportError(503, 'no_shard_available_action_exception', 'No shard available for [get [.tasks][task][1k_6ORgNRK6wPf8kc0_4xA:1124]: routing [null]]')

Any help would be greatly appreciated!

aphulera · May 19, 2025, 7:26am

Hey @erobinson!

I was out sick on Friday so did not see your reply then. Apologies for the delayed response. I had a look at the logs that you have shared. This looks like an issue with .tasks index. Elasticsearch uses it to track the status of the task like the reindex. The elastic_sync_multiplexed command used elasticsearch's tasks api to fetch the status of the reindex and display the same status on the shell. And that is what is failing here.

Looking at the log

2025-05-15T04:46:01,932][WARN ][o.e.t.TaskManager        ] [
ode_name][es0]  couldn't store response BulkIndexByScrollResponse[took=1m,timed_out=false,sliceId=null,updated=0,created=1302,deleted=0,batches=2,versionConflicts=0,noops=0,retries=0,throttledUntil=0s,bulk_failures=[],search_failures=[]]

I can see that there were no issues with the reindex itself. It succeeded and created 1302 docs but failed to store the status.

On the second attempt the command ran successfully. It looks like a transient issue with the cluster itself. So you should be good to go.

Confirming that there were no issues in the reindex for the server you shared log in - Issues with Changelog 0087 - upgrade to ES 6 - #4 by erobinson

I am sure you have ensured that there is sufficient disk on the cluster but want to double check on that?

And when you face this issue next time, would you be able to run the following commands and share their output as well?

curl -XGET 'http://<host_ip>:9200/_cluster/health?pretty'

curl -XGET 'http://<host_ip>:9200/_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason' | grep .tasks

curl -XGET 'http://<host_ip>:9200/_cluster/allocation/explain?pretty' -H 'Content-Type: application/json' -d'
{
  "index": ".tasks",
  "shard": 0,
  "primary": true
}'

I want to understand why ES is failing to update the .tasks index.

erobinson · May 19, 2025, 12:23pm

I'll send another update as soon as I can take the server offline for an hour. In a nutshell, I took a full server disk snapshot after the latest CCHQ was deployed, before I started with 0087 then after failure of 0087, I restored the snapshot to ensure it was in the same state as it was before starting. Effectively I will start with 0087 at next opportunity.

More to come,
Ed

erobinson · May 19, 2025, 5:32pm

Here you go:

(cchq) ccc@monolith:~/commcare-cloud$ curl -XGET 10.2.0.4:9200/_cluster/health?pretty
{
  "cluster_name" : "monolith-es",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 75,
  "active_shards" : 75,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

(cchq) ccc@monolith:~/commcare-cloud$ curl -XGET 'http://10.2.0.4:9200/_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason' | grep .tasks
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3817  100  3817    0     0  50735      0 --:--:-- --:--:-- --:--:-- 50893
.tasks                 0     p      STARTED es0

(cchq) ccc@monolith:~/commcare-cloud$ curl -XGET 'http://10.2.0.4:9200/_cluster/allocation/explain?pretty' -H 'Content-Type: application/json' -d'
{
  "index": ".tasks",
  "shard": 0,
  "primary": true
}'
{
  "index" : ".tasks",
  "shard" : 0,
  "primary" : true,
  "current_state" : "started",
  "current_node" : {
    "id" : "1k_6ORgNRK6wPf8kc0_4xA",
    "name" : "es0",
    "transport_address" : "10.2.0.4:9300",
    "weight_ranking" : 1
  },
  "can_remain_on_current_node" : "yes",
  "can_rebalance_cluster" : "yes",
  "can_rebalance_to_other_node" : "no",
  "rebalance_explanation" : "cannot rebalance as no target node exists that can both allocate this shard and improve the cluster balance"
}

aphulera · May 20, 2025, 2:09pm

Thanks @erobinson for sharing these. It looks like that the tasks index is fine and the cluster is also in green which means all the shards are assigned and cluster is healthy.

Did you run these commands right after you get the errors with reindex command? I am still not sure why would you get the errors in .tasks index. I assume it might be due to certain resource constraints on the system? But the good thing is it recovers soon.

Based on your previous updates, I can see that the reindex command succeeds on reruns. And its totally safe to run the reindex command multiple times.

Let me know if this workaround can work for you for now.

erobinson · May 21, 2025, 6:45am

Hi Amit, correct, I ran those after I got the errors with the reindex command. I recall seeing something similar before the last time we did some reindexing. It may be to do with the fact that this is a monolith and resources are not dedicated. I will continue with the reindex of all the indices. Is there something I can do to double check that the reindex worked for each?

Thanks again for your help,
Ed

aphulera · May 21, 2025, 7:35am

One thing would be running the reindex command again. If it succeeds, then you should be good to go.

Another indicator for a successful reindex would be doc count in both older and newer index.

cchq monolith django-manage elastic_sync_multiplexed display_doc_counts <index_cname>

This should give you similar count. The count may vary slightly for high volume indices (like case-search/forms/cases) if system is getting used but in most cases it should be same.

erobinson · May 21, 2025, 8:29am

OK great, thanks for the feedback Amit, it's very helpful!
Ed