The Postgresql server's machine is busy during working hours(especially 9:AM to 10:30 AM UTC-3). it's consuming High IO

demis08 · June 12, 2019, 4:15pm

The Postgresql Database server is busy at from 9:AM- 10:30AM. it's consumming High IO and even unable to ssh to the machine.
i've checked some of the possible causes:

check any scheduled jobs.
may be it's peak hour,
check what processes are consuming io, when you see high io usage on datadog. use: sudo iotop
I 've seen a process called postgres: autovacuum~ process a lot of Input/output process

Is there a mechanism to schedule autovacuum process at night or any other way to stop running this process during working hours?

Jon_Emord · June 12, 2019, 5:30pm

Hi Demisew,

We currently use postgres's default of autovacuum = on in commcare-cloud. This means that postgres schedules its own vacuum processes without any input from the user.

PostgreSQL: Documentation: 9.6: Automatic Vacuuming details some configuration that can be used to change/tune the autovacuum process

You could also schedule vacuum's via a cron job with vacuumdb. Some more general information on vacuuming in postgres is documented here: PostgreSQL: Documentation: 9.6: Routine Vacuuming

demis08 · June 14, 2019, 3:42pm

thank you Jon, the documentation is really helpful

erobinson · July 10, 2019, 6:17am

This may not be relevant in this case, but we had a similar issue on an older server running PostgreSQL 9.x under Ubuntu 18.04. The issue related to hardware compatibility and we had to switch to CentOS which was certified for use with the branded HW - it was an older box. We spent many hours troubleshooting and ultimately came to the conclusion that the HW vendor just did not support Ubuntu 18.0x with drivers for the disk channel system which resulted in inefficient processing without HW interrupts.

Jon_Emord · July 10, 2019, 2:31pm

Hi Ed,

Do you happen to know how your team was able to finally come to that conclusion? Were there any system logs or stats that you found that pointed to the number of HW interrupts being unreasonably high?

erobinson · July 15, 2019, 8:01pm

It was some time ago, but in a nutshell, the server was not certified for use with Ubuntu and switching to Centos resolved the issue. The kernel driver assumption relates to prior experience I've had with poorly managed interrupts on older systems, particularly the disk channel. On the system in question, a 600Mb DB PostgreSQL DB restore was still not complete after 12 hours. After deploying CentOS 7, with the same PostgreSQL configuration, the restore took minutes.