Server resources required for Commcare HQ

erobinson · November 10, 2020, 3:20pm

I noticed this 6 year old post regarding a server resource calculator for hosting CCHQ:

I realise it's 6 years old, and wondered if there was an updated version that took into account the current architecture and tool set in CCHQ.

Also, if you were to stand up an instance to support 600-800 daily use clients (21 days per month), where would you split the tool set per VM instance to offer some sort of redundancy and maximize performance? What is the ideal way to approach it? Should I preferably be looking at standing up independent microservice containers / vm perhaps?

Thanks!

Clayton_Sims · November 10, 2020, 5:02pm

Hi Ed,

Where the cluster needs to be spaced out will depend a bit on the usage pattern and complexity. There's an available resource model which you can use for sizing larger clusters on github, but I believe it presumes that you are isolating VM's into services, and I think it's generally used more on the "Tens of thousands of users or more" scale, so the baseline resource sizing might be overly large. I think at a smaller scale the swings in utilization can be so big that it's hard to do better than a heuristic guess based on which services are in use (IE: If you are doing high volume SMS, you will need more celery workers).

In particular for most environments the primary scale factor is form submissions, do you know what your expected usage is there? 800 users submitting 6ish forms a day is roughly the same as 200 users submitting 24 forms per day, so it's more the "forms per month" and "peak form submission / sync" factors that affect sizing. Whether UCR reporting is being used, and how complex the reports are will also affect resource usage.

We do generally recommend having a VM Per Service, although the overhead of that can be high for a small cluster. It's especially helpful to use VM-Per-Service for services which want to saturate their resource usage, like Elasticsearch, min.io, and Postgres. With some teams I've worked with using smaller clusters the first target for breaking out onto its own resource set has been Postgres, since having less contention on the core DB machine helps prevent cache churn, and the PG machines generally make good use of OS paging.

Last question: Do you have monitoring set up? HQ supports metrics reporting and monitoring through both Datadog (for in the cloud) and Prometheus (on-prem), both of which provide excellent insight for resource utilization and saturation.

-Clayton

erobinson · November 10, 2020, 5:14pm

Great response, thank you @Clayton_Sims for taking the time. We would certainly implement monitoring to keep abreast of trends and bottlenecks.
I appreciate the thorough answer, I will give it some thought.
Ed

Clayton_Sims · November 10, 2020, 6:07pm

No worries, let me know if you have any other questions, and I definitely encourage you to let folks know what things look like once you've had a chance to experiment! It would be great for us to have access to more data about different cluster configurations.

-Clayton

Ethan_Soergel · November 10, 2020, 10:01pm

I'll echo what Clayton said - this is a question every org hosting CommCare has for which there isn't an easy answer. Hearing from others' experience is a decent way to get a jumping off point. You'll likely need to either be very judicious with your estimates or ideally be prepared to monitor resource usage and iterate on hardware as you gradually scale usage.

I'm sure others on this forum would be interested to hear your learnings.