Elasticsearch:primary shards lost, after server restart due to power failure

demis08 · February 9, 2021, 5:59am

Hello,

I stuck with a problem on elasticsearch node after server restart due to power failure. the following is the detailed description of our scenario

we have a 5 node ES cluster as shown below.

nsible@echis8:~$ curl -XGET 172.19.4.43:9200/_cat/nodes?v
host ip heap.percent ram.percent load node.role master name
172.19.4.46 172.19.4.46 2 93 1.14 d * es3
172.19.4.48 172.19.4.48 5 44 3.00 d m es4
172.19.4.43 172.19.4.43 2 72 1.28 d m es1
172.19.3.41 172.19.3.41 5 59 0.37 d m es0
172.19.4.36 172.19.4.36 5 66 0.21 d m es2

the following shows our indices:

ansible@echis8:~$ curl -XGET 172.19.4.43:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
red open smslogs_2020-01-28 5 1 0 0 636b 318b
red open xforms_2016-07-07 5 1 1671642 59761 8.5gb 4.2gb
red open case_search_2018-05-29 5 1 115035243 786313 22gb 11gb
red open hqgroups_2017-05-29 5 1 4 10 31.8kb 15.9kb
green open report_cases_czei39du507m9mmpqk3y01x72a3ux4p0 5 1 0 0 1.5kb 795b
red open hqapps_2020-02-26 5 1 5733 712 51.6mb 25.8mb
red open hqusers_2017-09-07 2 1 88958 12452 65.6mb 32.7mb
red open hqdomains_2020-02-10 5 1 1 0 57.6kb 28.8kb
green open report_xforms_20160824_1708 5 1 0 0 1.5kb 795b
red open hqcases_2016-03-04 5 1

The following is an error shown on the log.

[2021-02-08 14:01:37,916][INFO ][cluster.service ] [es1] detected_master {es3}{5oxdhS4UQ5Gs79sAMp8jNA}{172.19.4.46}{172.19.4.46:9300}{max_local_storage_nodes=1}, added {{es3}{5oxdhS4UQ5Gs79sAMp8jNA}{172.19.4.46}{172.19.4.46:9300}{max_local_storage_nodes=1},{es4}{V6rcOJ1OTd-kscymmx6vhA}{172.19.4.48}{172.19.4.48:9300}{max_local_storage_nodes=1},{es0}{XZnVvUxdQbeukWBptsuulQ}{172.19.3.41}{172.19.3.41:9300}{max_local_storage_nodes=1},{es2}{obkkwxrlQ4KbTzOR21Ix_Q}{172.19.4.36}{172.19.4.36:9300}{max_local_storage_nodes=1},}, reason: zen-disco-receive(from master [{es3}{5oxdhS4UQ5Gs79sAMp8jNA}{172.19.4.46}{172.19.4.46:9300}{max_local_storage_nodes=1}])
[2021-02-08 14:01:38,092][WARN ][gateway ] [es1] [xforms_2016-07-07][1] shard state info found but indexUUID didn't match expected [DvE3mV0bS9aJ2V-bl9dkkA] actual [Litr2IELSgib-l3ZbndxrQ]
[2021-02-08 14:01:38,098][WARN ][gateway ] [es1] [xforms_2016-07-07][0] shard state info found but indexUUID didn't match expected [DvE3mV0bS9aJ2V-bl9dkkA] actual [Litr2IELSgib-l3ZbndxrQ]
[2021-02-08 14:01:38,109][WARN ][gateway ] [es1] [hqdomains_2020-02-10][2] shard state info found but indexUUID didn't match expected [sYdD4YntSzShdieuEJhflQ] actual [98MjNAdpQ2ywJaqmIlRL_g]
[2021-02-08 14:01:38,115][WARN ][gateway ] [es1] [hqdomains_2020-02-10][0] shard state info found but indexUUID didn't match expected [sYdD4YntSzShdieuEJhflQ] actual [98MjNAdpQ2ywJaqmIlRL_g]
[2021-02-08 14:01:38,215][INFO ][indices.store ] [es1] updating indices.store.throttle.type from [NONE] to [all]
[2021-02-08 14:01:38,215][INFO ][indices.store ] [es1] updating indices.store.throttle.max_bytes_per_sec from [10gb] to [500mb], note, type is [all]
[2021-02-08 14:01:40,618][WARN ][indices.cluster ] [es1] [[case_search_2018-05-29][4]] marking and sending shard failed due to [failed recovery]
[case_search_2018-05-29][[case_search_2018-05-29][4]] IndexShardRecoveryException[failed to fetch index version after copying it over]; nested: IndexShardRecoveryException[shard allocated for local recovery (post api), should exist, but doesn't, current files: ]; nested: IndexNotFoundException[no segments* file found in store(default(mmapfs(/opt/data/elasticsearch-2.4.6/data/echis-es/nodes/0/indices/case_search_2018-05-29/4/index),niofs(/opt/data/elasticsearch-2.4.6/data/echis-es/nodes/0/indices/case_search_2018-05-29/4/index))): files: ];
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:224)
at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: [case_search_2018-05-29][[case_search_2018-05-29][4]] IndexShardRecoveryException[shard allocated for local recovery (post api), should exist, but doesn't, current files: ]; nested: IndexNotFoundException[no segments* file found in store(default(mmapfs(/opt/data/elasticsearch-2.4.6/data/echis-es/nodes/0/indices/case_search_2018-05-29/4/index),niofs(/opt/data/elasticsearch-2.4.6/data/echis-es/nodes/0/indices/case_search_2018-05-29/4/index))): files: ];
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:208)
... 5 more
Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in store(default(mmapfs(/opt/data/elasticsearch-2.4.6/data/echis-es/nodes/0/indices/case_search_2018-05-29/4/index),niofs(/opt/data/elasticsearch-2.4.6/data/echis-es/nodes/0/indices/case_search_2018-05-29/4/index))): files:
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:726)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683)
at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:490)
at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:95)
at org.elasticsearch.index.store.Store.readSegmentsInfo(Store.java:164)
at org.elasticsearch.index.store.Store.readLastCommittedSegmentsInfo(Store.java:149)
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:199)
... 5 more

Since the issue persists for more than a day, I took a log backup(for further investigation) and recreate the indices.