Slow Data Count in New Server Migration

andyasne · May 30, 2024, 2:15pm

Hi Team,

As part of our server migration process, we've implemented a new server setup and created a straightforward datasource whose sole function is to count the number of clients (both open and closed cases). We started the build for this datasource, but the progress is concerning. After three days, it has only managed to account for approximately 20% of the data, which totals around 23 million client cases.

The counter quickly adds a few thousand, then stops for several seconds, and starts again. It’s moving very slowly, which is unusual since it's just counting cases.

Currently, there are no other UCRs running that could be affecting its performance.

Issues:

Slow progress in data counting on the new server.
Only 20% of the data has been processed in three days, far below expectations.

Questions:

What could be causing the counting process to be this slow?
Are there specific logs or tools we can use to diagnose issues on the new server?
Can anyone suggest configurations or optimizations that might improve the performance of our data processing?

Server Specs
Celery:
5 VMs (1 shared with pillow) - 16vCPUs and 32GB RAM each
Pillows:
3 VMs (1 shared with celery) - 16vCPUs and 32GB RAM each
Databases:
Main DB VM - 32vCPUs and 128GB RAM with SSD disk
UCR DB VM - 32vCPUs and 128GB RAM with SSD disk

UCR Datasource Definition

Here are the Logs
https://drive.google.com/drive/folders/1HLH1tlhAXlmGXut7rBDZC7PzAYyXH1jv?usp=sharing

Thank you!

mkangia · May 31, 2024, 1:29pm

Hi,

Just responding for some observations first,

The counter quickly adds a few thousand, then stops for several seconds, and starts again

This is expected to happen because the cases are processed in chunks. So, some data is picked for processing, dumped to the table and then the next batch is picked up. Hence, you might see a pause in between.

It’s moving very slowly, which is unusual since it's just counting cases.

Can you share why you say its just counting cases?
A datasource will process every case for all columns/indicators configured and save that as a new row in the datasource table. So, its iterating through all cases and processing them.

Also, if you just need the count of cases, that should be available in Case List, given that you have already indexed all the data into Elasticsearch after the migration to the new server. In Case List, you can see the total number of cases in the summary just under the table showing the results.

As for the data processing,
23 million is a lot of cases and will take time to process.
Please note that currently this is happening sequentially so the server resources won't make a difference.

If you need this faster you can try an alternate approach,

Create a new data source with the same configuration as this one and select the option "Asynchronous processing" on the data source before saving.
After you save it, check on "Preview Data" to see data being populated in this new data source. Initially, You might see an error about table being missing. You can get past that and force trigger data processing by clicking on "Rebuild Data Source" on the data source page.
Once this data source starts populating, you should see data coming in faster than the original data source you built. Asynchronous data sources are processed in chunks in parallel via a different UCR queue called "ucr_indicator_queue".
At this point, when you can see data populating in the new data source much faster than the one created before, you can delete the data source you created originally.

andyasne · June 3, 2024, 7:22am

Hi @mkangia ,

Thank you for your response.

The datasource definition we created has a single indicator. It is designed to go through all client cases in our system and add an entry of 1 to the UCR Table. The test report using this datasource will then aggregate this data and display the number of client cases.

We also have a large report with around 600 indicators that we plan to migrate to the new server. Before doing so, we wanted to test the new server's capability to handle a basic report, such as counting the number of family cases and client cases in our system. Our observation is that the datasource cannot process all the cases in a reasonable time to complete the client counter. Starting this large datasource, which accesses not only client cases but also 26 other case types, and includes many complex indicators, will not finish the build in time.

We have developed and tested many reports on the cloud, which have worked great. Since the cloud does not contain much data, the builds were able to finish, and we saw accurate data in the reports.

Asynchronous processing is already selected for this datasource. It has been a week since this datasource started building, and it has completed only 50% of the client counts.

mkangia · June 3, 2024, 9:14am

Hi,

The test report using this datasource will then aggregate this data and display the number of client cases.

I'd recommend using / switching to existing data sources to fetch the count instead of having a data source only for this.

Asynchronous processing is already selected for this datasource

Can you confirm if you still see it checked for the datasource when you view it now?

We also have a large report with around 600 indicators that we plan to migrate to the new server.

Another alternate is to trigger async rebuild manually on the server.
That can be done using the following management command,

github.com

dimagi/commcare-hq/blob/master/corehq/apps/userreports/management/commands/async_rebuild_table.py

import argparse
from collections import namedtuple

from django.core.management.base import BaseCommand

from corehq.apps.userreports.models import (
    AsyncIndicator,
    get_datasource_config,
)
from corehq.apps.userreports.util import get_indicator_adapter
from corehq.form_processor.models import CommCareCase, XFormInstance
from corehq.util.argparse_types import date_type

FakeChange = namedtuple('FakeChange', ['id', 'document'])
CASE_DOC_TYPE = 'CommCareCase'
XFORM_DOC_TYPE = 'XFormInstance'


class Command(BaseCommand):
    help = """

This file has been truncated. show original

andyasne · June 3, 2024, 9:29am

HI @mkangia

I'd recommend using / switching to existing data sources to fetch the count instead of having a data source only for this.

The server is a new server. All cases and forms are migrated except UCR tables and UCR datsource definitions. We then created this new datasource that count the client.

Asynchronous processing was ON before starting the Build( I rechecked this), Please see the screenshoot

andyasne · June 3, 2024, 9:44am

@mkangia

Upon further investigation, I noticed that the data source pauses for exactly 340 seconds( around 5 mins), increments by around 7,201, and then repeats this pattern.

I have a couple of questions:

This is expected to happen because the cases are processed in chunks. So, some data is picked for processing, dumped to the table and then the next batch is picked up. Hence, you might see a pause in between.

Is there a way to increase this batch processing size and decrease the waiting time?

2. Does the server require any specific internal configuration to enable asynchronous builds?

3.Is there a difference between rebuilding the data source using the "Rebuild Datasource" button in the UI and using the "async_rebuild_table" command?

4.Should we use the "async_rebuild_table" command to rebuild our data source and monitor its progress?

mkangia · June 3, 2024, 5:05pm

Hi,

The server is a new server. All cases and forms are migrated except UCR tables and UCR datsource definitions. We then created this new datasource that count the client.

Okay. If you plan to build any other data source for the same case type with indicators, it would be beneficial to use that one for counts instead of managing one data source just for counts.

Asynchronous processing was ON before starting the Build( I rechecked this)

Okay.

Is there a way to increase this batch processing size and decrease the waiting time?

This is the task that should be executing the cases.

github.com

dimagi/commcare-hq/blob/ff9e56acf5e6e28d8949b1e6557b738be623c332/corehq/apps/userreports/tasks.py#L244


      
                  hours_for_today = time_dictionary.get('*')
          
              for valid_hours in hours_for_today:
                  if valid_hours[0] <= time.hour <= valid_hours[1]:
                      return True
          
              return False
          
          
          @serial_task('queue-async-indicators', timeout=30 * 60, queue=settings.CELERY_PERIODIC_QUEUE, max_retries=0)
          def queue_async_indicators():
              """
                  Fetches AsyncIndicators that
                  1. were not queued till now or were last queued more than 4 hours ago
                  2. have failed less than ASYNC_INDICATOR_MAX_RETRIES times
                  This task quits after it has run for more than
                  ASYNC_INDICATOR_QUEUE_TIME - 30 seconds i.e 4 minutes 30 seconds.
                  While it runs, it clubs fetched AsyncIndicators by domain and doc type and queue them for processing.
              """
              start = datetime.utcnow()
              cutoff = start + ASYNC_INDICATOR_QUEUE_TIME - timedelta(seconds=30)

You can modify the following to see if that helps.

ASYNC_INDICATOR_QUEUE_TIME:
Currently its 5 minutes. You can see it higher, say, 30 minutes, observe if it helps and does not impact other operations on the site. Or else set it very high to avoid any wait at all.
The relevance of that time set is shared in function documentation I shared above.

2. Does the server require any specific internal configuration to enable asynchronous builds?

You just need to ensure that you have ucr_indicator_queue and background celery queues.

3.Is there a difference between rebuilding the data source using the "Rebuild Datasource" button in the UI and using the "async_rebuild_table" command?

Depending on different factors in the setup, async rebuild command could proceed faster than triggering rebuild from UI, though both processes are using the same async mechanism to process cases.
You could also add more celery processes for ucr_indicator_queue to process cases quicker.

4.Should we use the "async_rebuild_table" command to rebuild our data source and monitor its progress?

Are you able to first check how many records you have of the model AsyncIndicator?
In a django shell you can run

from corehq.apps.userreports.models import AsyncIndicator
AsyncIndicator.objects.count()

andyasne · June 4, 2024, 7:40am

Thank you very much for this @mkangia , We will implement these changes and monitor the performance again.

One last question:

Regarding the ucr_indicator_queue, we will increase the number of processes. To determine the optimal number of processes our resources can support, is there a method or calculation we can do, considering the resources previously mentioned?

Thanks

mkangia · June 4, 2024, 8:04am

Hi,

We will implement these changes and monitor the performance again.

Great!

Regarding the ucr_indicator_queue, we will increase the number of processes. To determine the optimal number of processes our resources can support, is there a method or calculation we can do, considering the resources previously mentioned?

You can increase them gradually as well. They don't need to be fully optimized before running the management command. If you see that there is room for more, you can add them even after triggering the command.
The command would just queue all the records to be processed so it will finish much sooner. The celery processes will process these and they can be increased if you need them to process faster.
Just a note that ucr_indicator_queue would then stay idle when not in use for such async operations, so good to reduce the resources back once you have finished the data source rebuilds.

Hope this was useful.
Good to keep an eye on AsyncIndicators count and tasks being processed by celery if you have insights, along with the data source rows like you have been already.

Regards.

andyasne · June 13, 2024, 9:03pm

Hi @mkangia ,

We modified the configuration, resulting in an increase in build throughput from approximately 7,200 records per 5 minutes to 10,000 records. Despite adding more servers and increasing the ucr_indicator_queue, the throughput remains capped at exactly 10,000 records. Thinking ASYNC_INDICATORS_TO_QUEUE might be the reason, we increased the value from10,000 to 100,000 but there is no change on rebuild. It appears there may be a limit set somewhere in the system. Is there a way to increase the build rate beyond 10,000 records per 5 minutes?

Thank you

ayadav · July 1, 2024, 5:55am

Hey @andyasne

My apologies for delayed response. It seems like you have taken appropriate steps as per Manish's suggestion for increasing the performance.

Just to confirm that the configurations were set successfully, may I ask for a couple of things ?

After updating the configurations, what command did you use to redeploy ? Just wanted to be sure that celery processes have picked up the updated configurations.
I see that you have earlier shared the heartbeat logs. Could you please share the below celery logs instead

ucr_indicator_queue
celery_periodic

sirajhassan · July 1, 2024, 12:11pm

Hi @ayadav,

cchq env update-config and cchq env update-supervisor-confs commands are used to apply the changes.

This is the log files link.

ayadav · July 3, 2024, 9:00am

Thanks @sirajhassan for sharing the requested info, the commands that were run looks fine.
However it does require restart of services for new settings to take effect. May I suggest you to run below two commands if not already done ?

cchq env service commcare restart
cchq env service celery restart

Please let us know if this fixes the issue.