Pillowtop service stalling

erobinson · August 19, 2024, 8:23am

Full disclosure - the app that we're currently having this challenge has grown significantly to the point that we want to split it up into multiple apps, so we're aware of that and I imagine that is contributing to this issue, but it seems that our pillowtop service stalls periodically for the xform-pillow and case-pillow pillows.

What I'm having to do is restart those pillows, after which around 10 get processed, then restart them again, otherwise they just sit there and don't process.

e.g. I'm looking at my case pillows on the system info page:

My understanding is that there are 507 outstanding items to process. It's doesn't budge until I restart the pillow with:
cchq monolith service pillowtop restart --only=case-pillow
...after which it seems to get through 10:

...and it sits there until I restart again.
I had similar issues with both the xform and case pillows (xform is now cleared). The log shows this from the time of restarting the pillow:

2024-08-19 08:20:45,361 INFO interface Starting pillow <class 'pillowtop.pillow.interface.ConstructedPillow'>
2024-08-19 08:20:46,041 WARNING pillow UCR pillow has no configs to process
2024-08-19 08:20:46,045 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:46,297 INFO manager (case-pillow-cases-20230524-case-search-20230524-messaging-sync) setting checkpoint: {"case-sql,0": 6591638}
2024-08-19 08:20:46,317 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:46,530 INFO manager Heartbeat: {TopicPartition(topic='case-sql', partition=0): 6591648}
2024-08-19 08:20:46,540 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:46,761 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:47,001 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:47,227 INFO elastic Processing chunk of changes in BulkElasticProcessor
failed to send, dropping 100 traces to intake at http://localhost:8126/v0.5/traces after 3 retries
2024-08-19 08:20:47,263 ERROR [ddtrace.internal.writer.writer] failed to send, dropping 100 traces to intake at http://localhost:8126/v0.5/traces after 3 retries
2024-08-19 08:20:47,461 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:47,676 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:47,907 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:48,135 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:48,345 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:48,617 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:48,838 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:49,056 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:49,286 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:49,518 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:49,730 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:49,951 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:50,177 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:50,402 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:50,620 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:50,867 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:51,091 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:51,356 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:51,569 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:51,794 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:52,022 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:52,252 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:52,474 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:52,697 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:52,936 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:53,167 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:53,397 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:53,633 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:53,879 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:54,118 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:54,359 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:54,609 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:54,848 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:55,077 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:55,293 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:55,540 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:55,769 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:56,061 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:56,284 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:56,528 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:56,738 INFO manager Heartbeat: {TopicPartition(topic='case-sql', partition=0): 6592088}
2024-08-19 08:20:56,748 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:56,987 INFO elastic Processing chunk of changes in BulkElasticProcessor
2024-08-19 08:20:57,206 INFO elastic Processing chunk of changes in BulkElasticProcessor

Could this be related to the size of the app, or is there something else I should look at to keep the system processing forms and cases, or am I misunderstanding something about the process?

Thanks!

EDIT I do wonder if I'm misunderstanding the process and / or these numbers. I can (for example), restart the service and reduce the number in brackets by 10 each time, but it goes down to single digits and the number in brackets doesn't go down to 0 with restarts:

erobinson · September 5, 2024, 9:21am

Apologies for bumping this, but I just want to understand whether it's normal for the case-pillow and xform-pillow statuses to be sitting on numbers like 58951 / 59460 (509) and 101795 / 102517 (722) as I'm seeing this on another server too.

Thanks!

ayadav · September 5, 2024, 12:57pm

Hi @erobinson,

just want to understand whether it's normal for the case-pillow and xform-pillow statuses to be sitting on numbers

Apologies for the delayed response. I don't think its a normal behaviour for the status to be stuck at some number.
In the meantime as I look more into this, I wanted to ask - Am I correct in assuming that as an effect of this stalling, you are also seeing lags in reports like Case List Explorer I mean new cases that are created do not appear in the reports?

erobinson · September 5, 2024, 6:55pm

Yes, that seems to be the case, though after restarting the service and leaving it for some hours, it seems to have cleared both queues. This morning it was stuck on a fairly large number of outstanding cases and forms (about 4000-5000). A restart of the xform and case queues flushed that number down to in the hundreds and leaving it for these hours seems to have allowed it to flush the rest.

erobinson · September 6, 2024, 11:19am

Just a quick follow up. Is there a way to get the pillowtop queue status from a shell on the server without having to send the vault PW - or an API endpoint to do query it? Thanks!

CharlSmit · September 16, 2024, 9:35am

Hi, @erobinson

Maybe cchq <env> service pillowtop status is what you're looking for?

erobinson · September 16, 2024, 11:11am

Thanks Charl, unfortunately that requires the vault password and seems to only respond with basic status (running or not), and not. e.g. right now, this is what I see under system info:

While this is what the pillowtop status command outputs:

What does the 407 indicate in the first image? It's been sitting there for a while. Perhaps I'm not understanding the process.

EDIT it's dropped down to 1 now after a couple of hours (I haven't been watching it closely)