As part of our server migration process, we've implemented a new server setup and created a straightforward datasource whose sole function is to count the number of clients (both open and closed cases). We started the build for this datasource, but the progress is concerning. After three days, it has only managed to account for approximately 20% of the data, which totals around 23 million client cases.
The counter quickly adds a few thousand, then stops for several seconds, and starts again. It’s moving very slowly, which is unusual since it's just counting cases.
Currently, there are no other UCRs running that could be affecting its performance.
Issues:
Slow progress in data counting on the new server.
Only 20% of the data has been processed in three days, far below expectations.
Questions:
What could be causing the counting process to be this slow?
Are there specific logs or tools we can use to diagnose issues on the new server?
Can anyone suggest configurations or optimizations that might improve the performance of our data processing?
Server Specs
Celery:
5 VMs (1 shared with pillow) - 16vCPUs and 32GB RAM each
Pillows:
3 VMs (1 shared with celery) - 16vCPUs and 32GB RAM each
Databases:
Main DB VM - 32vCPUs and 128GB RAM with SSD disk
UCR DB VM - 32vCPUs and 128GB RAM with SSD disk
The counter quickly adds a few thousand, then stops for several seconds, and starts again
This is expected to happen because the cases are processed in chunks. So, some data is picked for processing, dumped to the table and then the next batch is picked up. Hence, you might see a pause in between.
It’s moving very slowly, which is unusual since it's just counting cases.
Can you share why you say its just counting cases?
A datasource will process every case for all columns/indicators configured and save that as a new row in the datasource table. So, its iterating through all cases and processing them.
Also, if you just need the count of cases, that should be available in Case List, given that you have already indexed all the data into Elasticsearch after the migration to the new server. In Case List, you can see the total number of cases in the summary just under the table showing the results.
As for the data processing,
23 million is a lot of cases and will take time to process.
Please note that currently this is happening sequentially so the server resources won't make a difference.
If you need this faster you can try an alternate approach,
Create a new data source with the same configuration as this one and select the option "Asynchronous processing" on the data source before saving.
After you save it, check on "Preview Data" to see data being populated in this new data source. Initially, You might see an error about table being missing. You can get past that and force trigger data processing by clicking on "Rebuild Data Source" on the data source page.
Once this data source starts populating, you should see data coming in faster than the original data source you built. Asynchronous data sources are processed in chunks in parallel via a different UCR queue called "ucr_indicator_queue".
At this point, when you can see data populating in the new data source much faster than the one created before, you can delete the data source you created originally.
The datasource definition we created has a single indicator. It is designed to go through all client cases in our system and add an entry of 1 to the UCR Table. The test report using this datasource will then aggregate this data and display the number of client cases.
We also have a large report with around 600 indicators that we plan to migrate to the new server. Before doing so, we wanted to test the new server's capability to handle a basic report, such as counting the number of family cases and client cases in our system. Our observation is that the datasource cannot process all the cases in a reasonable time to complete the client counter. Starting this large datasource, which accesses not only client cases but also 26 other case types, and includes many complex indicators, will not finish the build in time.
We have developed and tested many reports on the cloud, which have worked great. Since the cloud does not contain much data, the builds were able to finish, and we saw accurate data in the reports.
Asynchronous processing is already selected for this datasource. It has been a week since this datasource started building, and it has completed only 50% of the client counts.
I'd recommend using / switching to existing data sources to fetch the count instead of having a data source only for this.
The server is a new server. All cases and forms are migrated except UCR tables and UCR datsource definitions. We then created this new datasource that count the client.
Asynchronous processing was ON before starting the Build( I rechecked this), Please see the screenshoot
Upon further investigation, I noticed that the data source pauses for exactly 340 seconds( around 5 mins), increments by around 7,201, and then repeats this pattern.
I have a couple of questions:
This is expected to happen because the cases are processed in chunks. So, some data is picked for processing, dumped to the table and then the next batch is picked up. Hence, you might see a pause in between.
Is there a way to increase this batch processing size and decrease the waiting time?
2. Does the server require any specific internal configuration to enable asynchronous builds?
3.Is there a difference between rebuilding the data source using the "Rebuild Datasource" button in the UI and using the "async_rebuild_table" command?
4.Should we use the "async_rebuild_table" command to rebuild our data source and monitor its progress?
The server is a new server. All cases and forms are migrated except UCR tables and UCR datsource definitions. We then created this new datasource that count the client.
Okay. If you plan to build any other data source for the same case type with indicators, it would be beneficial to use that one for counts instead of managing one data source just for counts.
Asynchronous processing was ON before starting the Build( I rechecked this)
Okay.
Is there a way to increase this batch processing size and decrease the waiting time?
This is the task that should be executing the cases.
You can modify the following to see if that helps.
ASYNC_INDICATOR_QUEUE_TIME:
Currently its 5 minutes. You can see it higher, say, 30 minutes, observe if it helps and does not impact other operations on the site. Or else set it very high to avoid any wait at all.
The relevance of that time set is shared in function documentation I shared above.
2. Does the server require any specific internal configuration to enable asynchronous builds?
You just need to ensure that you have ucr_indicator_queue and background celery queues.
3.Is there a difference between rebuilding the data source using the "Rebuild Datasource" button in the UI and using the "async_rebuild_table" command?
Depending on different factors in the setup, async rebuild command could proceed faster than triggering rebuild from UI, though both processes are using the same async mechanism to process cases.
You could also add more celery processes for ucr_indicator_queue to process cases quicker.
4.Should we use the "async_rebuild_table" command to rebuild our data source and monitor its progress?
Are you able to first check how many records you have of the model AsyncIndicator?
In a django shell you can run
from corehq.apps.userreports.models import AsyncIndicator
AsyncIndicator.objects.count()
Thank you very much for this @mkangia , We will implement these changes and monitor the performance again.
One last question:
Regarding the ucr_indicator_queue, we will increase the number of processes. To determine the optimal number of processes our resources can support, is there a method or calculation we can do, considering the resources previously mentioned?
We will implement these changes and monitor the performance again.
Great!
Regarding the ucr_indicator_queue, we will increase the number of processes. To determine the optimal number of processes our resources can support, is there a method or calculation we can do, considering the resources previously mentioned?
You can increase them gradually as well. They don't need to be fully optimized before running the management command. If you see that there is room for more, you can add them even after triggering the command.
The command would just queue all the records to be processed so it will finish much sooner. The celery processes will process these and they can be increased if you need them to process faster.
Just a note that ucr_indicator_queue would then stay idle when not in use for such async operations, so good to reduce the resources back once you have finished the data source rebuilds.
Hope this was useful.
Good to keep an eye on AsyncIndicators count and tasks being processed by celery if you have insights, along with the data source rows like you have been already.
We modified the configuration, resulting in an increase in build throughput from approximately 7,200 records per 5 minutes to 10,000 records. Despite adding more servers and increasing the ucr_indicator_queue, the throughput remains capped at exactly 10,000 records. Thinking ASYNC_INDICATORS_TO_QUEUE might be the reason, we increased the value from10,000 to 100,000 but there is no change on rebuild. It appears there may be a limit set somewhere in the system. Is there a way to increase the build rate beyond 10,000 records per 5 minutes?
My apologies for delayed response. It seems like you have taken appropriate steps as per Manish's suggestion for increasing the performance.
Just to confirm that the configurations were set successfully, may I ask for a couple of things ?
After updating the configurations, what command did you use to redeploy ? Just wanted to be sure that celery processes have picked up the updated configurations.
I see that you have earlier shared the heartbeat logs. Could you please share the below celery logs instead
Thanks @sirajhassan for sharing the requested info, the commands that were run looks fine.
However it does require restart of services for new settings to take effect. May I suggest you to run below two commands if not already done ?
cchq env service commcare restart
cchq env service celery restart