Unable to upgrade CCHQ to a4b2a47 from b2b0383

Hi, I'm having difficulty getting a monolith server on b2b0383 (deployed for changelog 0080) to the latest verified production build (a4b2a47 as at the time of this post).

Deployment of b2b0383 goes fine, as does changelog 80's command:
commcare-cloud monolith django-manage --tmux copy_invitation_supply_point
the output for that command is:

Started at 2024-07-09 07:41:59
Processing [.........] 9/9 100% 0:00:00.028682 elapsed
Finished at 2024-07-09 07:41:59
Elapsed time: 0:00:00
Connection to 10.2.0.4 closed.

On first attempt at deploying a4b2a47 I receive the following:

(cchq) ccc@monolith:~/commcare-cloud$ cchq monolith deploy --commcare-rev a4b2a47
Whoa there bud! You're deploying non-default.
'commcare' repo: master != a4b2a47
ARE YOU DOING SOMETHING EXCEPTIONAL THAT WARRANTS THIS? [y/N]y
ansible 'django_manage[0]' -m shell -i /home/ccc/environments/monolith/inventory.ini -a 'sudo -iu cchq bash -c '"'"'git --git-dir=/home/cchq/www/monolith/current/.git rev-parse HEAD'"'"'' -u ansible '--ssh-common-args=-o UserKnownHostsFile=/home/ccc/environments/monolith/known_hosts' --diff
Diff generation skipped. Supply a Github token to see deploy diffs.
New version details:
    Branch deployed     : commcare: a4b2a47

Changelogs:
    There have been some changelogs since last deploy, you must make sure that these are applied on your environment before deploying to avoid getting your environment into a broken state.

https://commcare-cloud.readthedocs.io/en/latest/changelog/0081-delete-receiverwrapper-couch-db.html
https://commcare-cloud.readthedocs.io/en/latest/changelog/0080-copy-invitation-supply-point-fields-to-location.html


Here's the complete diff on github: https://github.com/dimagi/commcare-hq/compare/b2b0383598d89238466cfebc384b0c6e142961bc...a4b2a470622237221f2a4bbf7e776d96e0cfb228
Are you sure you want to deploy to monolith? [y/N]y
...
TASK [Run run_migrations] *************************************************************************************************************************************************************

TASK [deploy_hq : Migrate databases] **************************************************************************************************************************************************
failed: [10.2.0.4] (item=migrate_multi --noinput) => {"ansible_loop_var": "item", "changed": true, "cmd": ["./manage.py", "migrate_multi", "--noinput"], "delta": "0:00:58.053610", "end": "2024-07-09 05:14:28.507159", "item": "migrate_multi --noinput", "msg": "non-zero return code", "rc": 1, "start": "2024-07-09 05:13:30.453549", "stderr": 
...
state = self.apply_migration(", "  File \"/home/cchq/www/monolith/releases/2024-07-09_05.02/python_env/lib/python3.9/site-packages/django/db/migrations/executor.py\", line 252, in apply_migration", "    state = migration.apply(state, schema_editor)", "  File \"/home/cchq/www/monolith/releases/2024-07-09_05.02/python_env/lib/python3.9/site-packages/django/db/migrations/migration.py\", line 132, in apply", "    operation.database_forwards(", "  File \"/home/cchq/www/monolith/releases/2024-07-09_05.02/python_env/lib/python3.9/site-packages/django/db/migrations/operations/special.py\", line 193, in database_forwards", "    self.code(from_state.apps, schema_editor)", "  File \"/home/cchq/www/monolith/releases/2024-07-09_05.02/corehq/apps/es/migration_operations.py\", line 368, in run", "    return super().run(*args, **kwargs)", "  File \"/home/cchq/www/monolith/releases/2024-07-09_05.02/corehq/apps/es/migration_operations.py\", line 91, in run", "    manager.index_create(self.name, self.render_index_metadata(", "  File \"/home/cchq/www/monolith/releases/2024-07-09_05.02/corehq/apps/es/client.py\", line 219, in index_create", "    self._es.indices.create(index, metadata)", "  File \"/home/cchq/www/monolith/releases/2024-07-09_05.02/python_env/lib/python3.9/site-packages/elasticsearch5/client/utils.py\", line 73, in _wrapped", "    return func(*args, params=params, **kwargs)", "  File \"/home/cchq/www/monolith/releases/2024-07-09_05.02/python_env/lib/python3.9/site-packages/elasticsearch5/client/indices.py\", line 106, in create", "    return self.transport.perform_request('PUT', _make_path(index),", "  File \"/home/cchq/www/monolith/releases/2024-07-09_05.02/python_env/lib/python3.9/site-packages/elasticsearch5/transport.py\", line 312, in perform_request", "    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)", "  File \"/home/cchq/www/monolith/releases/2024-07-09_05.02/python_env/lib/python3.9/site-packages/elasticsearch5/connection/http_urllib3.py\", line 143, in perform_request", "    raise ConnectionTimeout('TIMEOUT', str(e), e)", "elasticsearch5.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='10.2.0.4', port=9200): Read timed out. (read timeout=30))"]}

Full error log here: TASK [deploy_hq : Migrate databases] ******************************************* - Pastebin.com

At this point, the es cluster health status is red. A restart of ES resolves that. I then attempt to resume the deploy:
cchq monolith deploy --resume=2024-07-09_05.02
This results in the following error: TASK [deploy_hq : Migrate databases] ******************************************* - Pastebin.com

From that, I gather that one of the migrations required an older version of the code base to run so I checked out 06f5905 and ran ./manage.py migrate_multi repeaters in a cchq python venv:

sudo su - cchq
cd www/monolith/current
git checkout 06f59059cef7321849e0ea9d8e15b7e824d3e26f
source python_env/bin/activate
./manage.py migrate_multi repeaters

I got this output:

The following databases will be migrated:
 * default
 * p1
 * p2
 * proxy
 * synclogs


The following databases will be skipped:
 * ucr

Found 0 RepeatRecord documents to migrate.

I then tried resuming the previous deploy with:
cchq monolith deploy --resume=2024-07-09_05.02
and got this error: failed: [10.2.0.4] (item=migrate_multi --noinput) => {"ansible_loop_var": "item" - Pastebin.com
That seems to indicate it can't continue because changelog 0080 has not been applied if I'm reading it correctly? I definitely ran 0080 and tried again to be sure and got the same output as before. I then tried resuming the failed deploy again but got the previous error about needing to do changelog 0080 first.
This part of the error seemed relevant:

raise FieldError(", "django.core.exceptions.FieldError: Cannot resolve keyword 'location' into field. 
Choices are: assigned_locations, asyncsignuprequest, custom_user_data, domain, email, email_status, invited_by, invited_on, is_accepted, primary_location, primary_location_id, profile, profile_id, program, role, supply_point, tableau_group_ids, tableau_role, uuid"],`

Any assistance would be greatly appreciated!

@aphulera

Hi @erobinson, can you tell us a little more about what happened before this? Changelog 0080 is from April, and I thought you'd been able to get on latest master more recently than that, as discussed in this thread:

What is the latest deployed commit? Is this the same server environment? I know there was some trouble there, so I wonder if the deploy wasn't recorded sucessfully, causing that changelog to be displayed even if you didn't need to run it.

These three things all refer to the same operation:

You should be able to see whether that's already been applied by running

commcare-cloud monolith django-manage --tmux showmigrations users

Hi Ethan, apologies for not being clearer. This is another server that we manage. It was a little behind for various reasons. The latest deployed commit is b2b0383 as per the opening post.

commcare-cloud monolith django-manage --tmux showmigrations users outputs the following:

 [X] 0017_merge_20200608_1401
 [X] 0018_auto_20200619_1626
 [X] 0019_editreportspermissions
 [X] 0020_user_staging_pk_to_bigint
 [X] 0021_invitation_email_status
 [X] 0021_add_view_apps_permission
 [X] 0022_merge_20200814_2045
 [X] 0023_hqapikey_role_id
 [X] 0024_add_login_as_all_users
 [X] 0025_hqapikey_domain
 [X] 0026_sql_role_permissions
 [X] 0027_role_permission_unique
 [X] 0028_remove_domainpermissionsmirror
 [X] 0028_migrate_roles_to_sql
 [X] 0029_merge_20210611_1436
 [X] 0029_userhistory
 [X] 0030_userhistory_user_upload_record
 [X] 0031_delete_domainpermissionsmirror
 [X] 0032_add_download_reports_permission
 [X] 0033_rename_sqluserrole
 [X] 0034_auto_20210813_1121
 [X] 0035_auto_20210826_1155
 [X] 0036_reset_user_history_records
 [X] 0037_add_edit_messaging_permission
 [X] 0038_auto_20210930_1235
 [X] 0038_add_ucr_permissions
 [X] 0039_merge_20211007_0556
 [X] 0040_add_repr_and_changed_by_repr
 [X] 0041_migrate_reprs
 [X] 0042_deactivatemobileworkertrigger
 [X] 0043_add_release_management_permission
 [X] 0044_userrole_is_commcare_user_default
 [X] 0045_add_view_tableau_permission
 [X] 0046_add_default_mobile_worker_role
 [X] 0047_rename_sqlpermission_permission
 [X] 0048_expiring_keys
 [X] 0049_add_data_dict_permissions
 [X] 0050_add_manage_attendance_tracking_permission
 [X] 0051_add_attendance_tracking_privilege
 [X] 0052_hqapikey_last_used
 [X] 0053_userreportingmetadatastaging_fcm_token
 [X] 0054_connectiduserlink
 [X] 0055_add_user_data
 [X] 0056_add_manage_domain_alerts_permission
 [X] 0057_populate_sql_user_data
 [X] 0058_createprofileanduserdata_copylocation

0059 is missing from the list and running:
commcare-cloud monolith django-manage --tmux copy_invitation_supply_point
Outputs the following:

Started at 2024-07-09 14:28:02
Processing [.........] 9/9 100% 0:00:00.030965 elapsed
Finished at 2024-07-09 14:28:02
Elapsed time: 0:00:00

commcare-cloud monolith django-manage --tmux showmigrations users right after the previous command outputs the same (0059 is not there).

Thanks!

Thank you, that makes it a bit clearer. Have you tried running the deploy again since? That management command you ran as per changelog 80 is supposed to make user.0059 get skipped, so I was initially perplexed by the output. I do see however that there's a one-hour cache on the migration status check, which might explain this, though it should be cleared automatically when the status is changed. Regardless, it should have lapsed by now, so a second run should see that migration be a noop.

If not, you might try running this in a django shell:

from corehq.apps.domain_migration_flags.api import get_migration_complete
get_migration_complete("*", "users_invitation_copy_supply_point_to_location_field")

That should return True, indicating that the migration is complete

1 Like

Hey Ethan, thanks for the quick response.
I'm getting False on the above unfortunately

Just to add, I did a quick ls of the current deployment users app directory and 0059 is not listed:

(cchq) ccc@monolith:~/commcare-cloud$ ls /home/cchq/www/monolith/current/corehq/apps/users/migrations/ -l
total 260
-rw-rw-r-- 1 cchq cchq 173 Jul 9 04:17 0001_add_location_permission.py
-rw-rw-r-- 1 cchq cchq 787 Jul 9 04:17 0002_domainrequest.py
-rw-rw-r-- 1 cchq cchq 425 Jul 9 04:17 0003_roles_permissions_update.py
-rw-rw-r-- 1 cchq cchq 1377 Jul 9 04:17 0004_rm_role_id_from_admins.py
-rw-rw-r-- 1 cchq cchq 1303 Jul 9 04:17 0005_add_user_reporting_metadata.py
-rw-rw-r-- 1 cchq cchq 1391 Jul 9 04:17 0006_auto_20191225_1207.py
-rw-rw-r-- 1 cchq cchq 478 Jul 9 04:17 0007_userreportingmetadatastaging_created_on.py
-rw-rw-r-- 1 cchq cchq 489 Jul 9 04:17 0008_auto_20200129_1852.py
-rw-rw-r-- 1 cchq cchq 1371 Jul 9 04:17 0009_sqlinvitation.py
-rw-rw-r-- 1 cchq cchq 189 Jul 9 04:17 0010_populate_usersinvitation.py
-rw-rw-r-- 1 cchq cchq 612 Jul 9 04:17 0011_sqlinvitation_uuid.py
-rw-rw-r-- 1 cchq cchq 351 Jul 9 04:17 0012_alter_usersinvitation.py
-rw-rw-r-- 1 cchq cchq 510 Jul 9 04:17 0013_remove_sqlinvitation_ids.py
-rw-rw-r-- 1 cchq cchq 473 Jul 9 04:17 0014_rename_usersinvitation.py
-rw-rw-r-- 1 cchq cchq 706 Jul 9 04:17 0015_domainpermissionsmirror.py
-rw-rw-r-- 1 cchq cchq 2066 Jul 9 04:17 0016_hqapikey.py
-rw-rw-r-- 1 cchq cchq 410 Jul 9 04:17 0016_webappspermissions.py
-rw-rw-r-- 1 cchq cchq 325 Jul 9 04:17 0017_merge_20200608_1401.py
-rw-rw-r-- 1 cchq cchq 523 Jul 9 04:17 0018_auto_20200619_1626.py
-rw-rw-r-- 1 cchq cchq 405 Jul 9 04:17 0019_editreportspermissions.py
-rw-rw-r-- 1 cchq cchq 430 Jul 9 04:17 0020_user_staging_pk_to_bigint.py
-rw-rw-r-- 1 cchq cchq 412 Jul 9 04:17 0021_add_view_apps_permission.py
-rw-rw-r-- 1 cchq cchq 412 Jul 9 04:17 0021_invitation_email_status.py
-rw-rw-r-- 1 cchq cchq 281 Jul 9 04:17 0022_merge_20200814_2045.py
-rw-rw-r-- 1 cchq cchq 411 Jul 9 04:17 0023_hqapikey_role_id.py
-rw-rw-r-- 1 cchq cchq 403 Jul 9 04:17 0024_add_login_as_all_users.py
-rw-rw-r-- 1 cchq cchq 414 Jul 9 04:17 0025_hqapikey_domain.py
-rw-rw-r-- 1 cchq cchq 4134 Jul 9 04:17 0026_sql_role_permissions.py
-rw-rw-r-- 1 cchq cchq 369 Jul 9 04:17 0027_role_permission_unique.py
-rw-rw-r-- 1 cchq cchq 446 Jul 9 04:17 0028_migrate_roles_to_sql.py
-rw-rw-r-- 1 cchq cchq 1621 Jul 9 04:17 0028_remove_domainpermissionsmirror.py
-rw-rw-r-- 1 cchq cchq 284 Jul 9 04:17 0029_merge_20210611_1436.py
-rw-rw-r-- 1 cchq cchq 1235 Jul 9 04:17 0029_userhistory.py
-rw-rw-r-- 1 cchq cchq 585 Jul 9 04:17 0030_userhistory_user_upload_record.py
-rw-rw-r-- 1 cchq cchq 323 Jul 9 04:17 0031_delete_domainpermissionsmirror.py
-rw-rw-r-- 1 cchq cchq 638 Jul 9 04:17 0032_add_download_reports_permission.py
-rw-rw-r-- 1 cchq cchq 782 Jul 9 04:17 0033_rename_sqluserrole.py
-rw-rw-r-- 1 cchq cchq 726 Jul 9 04:17 0034_auto_20210813_1121.py
-rw-rw-r-- 1 cchq cchq 1024 Jul 9 04:17 0035_auto_20210826_1155.py
-rw-rw-r-- 1 cchq cchq 542 Jul 9 04:17 0036_reset_user_history_records.py
-rw-rw-r-- 1 cchq cchq 1104 Jul 9 04:17 0037_add_edit_messaging_permission.py
-rw-rw-r-- 1 cchq cchq 1208 Jul 9 04:17 0038_add_ucr_permissions.py
-rw-rw-r-- 1 cchq cchq 510 Jul 9 04:17 0038_auto_20210930_1235.py
-rw-rw-r-- 1 cchq cchq 271 Jul 9 04:17 0039_merge_20211007_0556.py
-rw-rw-r-- 1 cchq cchq 581 Jul 9 04:17 0040_add_repr_and_changed_by_repr.py
-rw-rw-r-- 1 cchq cchq 443 Jul 9 04:17 0041_migrate_reprs.py
-rw-rw-r-- 1 cchq cchq 654 Jul 9 04:17 0042_deactivatemobileworkertrigger.py
-rw-rw-r-- 1 cchq cchq 426 Jul 9 04:17 0043_add_release_management_permission.py
-rw-rw-r-- 1 cchq cchq 432 Jul 9 04:17 0044_userrole_is_commcare_user_default.py
-rw-rw-r-- 1 cchq cchq 418 Jul 9 04:17 0045_add_view_tableau_permission.py
-rw-rw-r-- 1 cchq cchq 1156 Jul 9 04:17 0046_add_default_mobile_worker_role.py
-rw-rw-r-- 1 cchq cchq 352 Jul 9 04:17 0047_rename_sqlpermission_permission.py
-rw-rw-r-- 1 cchq cchq 700 Jul 9 04:17 0048_expiring_keys.py
-rw-rw-r-- 1 cchq cchq 511 Jul 9 04:17 0049_add_data_dict_permissions.py
-rw-rw-r-- 1 cchq cchq 473 Jul 9 04:17 0050_add_manage_attendance_tracking_permission.py
-rw-rw-r-- 1 cchq cchq 550 Jul 9 04:17 0051_add_attendance_tracking_privilege.py
-rw-rw-r-- 1 cchq cchq 415 Jul 9 04:17 0052_hqapikey_last_used.py
-rw-rw-r-- 1 cchq cchq 404 Jul 9 04:17 0053_userreportingmetadatastaging_fcm_token.py
-rw-rw-r-- 1 cchq cchq 987 Jul 9 04:17 0054_connectiduserlink.py
-rw-rw-r-- 1 cchq cchq 1446 Jul 9 04:17 0055_add_user_data.py
-rw-rw-r-- 1 cchq cchq 586 Jul 9 04:17 0056_add_manage_domain_alerts_permission.py
-rw-rw-r-- 1 cchq cchq 501 Jul 9 04:17 0057_populate_sql_user_data.py
-rw-rw-r-- 1 cchq cchq 1386 Jul 9 04:17 0058_createprofileanduserdata_copylocation.py
-rw-rw-r-- 1 cchq cchq 0 Jul 9 04:17 init.py
drwxrwxr-x 2 cchq cchq 4096 Jul 9 04:24 pycache

A git log in the current deployment shows:

commit b2b0383598d89238466cfebc384b0c6e142961bc (HEAD)
Author: Jonathan Tang Jonahtang738@gmail.com
Date: Wed Apr 3 16:59:31 2024 -0400

add custom user data that isn't part of the profile

This custom user data will be used to set up the web user

Just wanted to confirm I'm on the correct commit for changelog 80:
The following version of CommCare must be deployed before rolling out this change: [b2b03835](https://github.com/dimagi/commcare-hq/commit/b2b0383598d89238466cfebc384b0c6e142961bc)
That commit appears to for be for user migration 0058 (April 3), not 59. It looks like 59 comes with commit b822aaf

I assume I should deploy b822aaf (April 30) and retry changelog 80?
EDIT I see it had been renamed at that point and was actually introduced on 1 April with commit 2ee328c

I think I see what's happened here. The changelog references the wrong commits - the migration you run with

commcare-cloud monolith django-manage --tmux copy_invitation_supply_point

Is an older version that doesn't actually record completion. If you ran the migration on commit 612deda1af1f7729b6013765d99c88b0db9197aa or newer, it should work properly. However, since you've already run the migration, it'd probably be easier to just manually mark the migration complete. You can do that similar to how you checked whether it has been flagged as complete:

from corehq.apps.domain_migration_flags.api import set_migration_complete
set_migration_complete("*", "users_invitation_copy_supply_point_to_location_field")

Can you try running that and then try deploying again?

Just saw your edit - the migration command isn't actually what needs to be deployed first. Dimagi often follows a pattern described here for migrations. In brief, there are two components:

  1. A management command that performs the actual database operations.
  2. A django migration that runs the management command if necessary. This ensures the command is executed on all environments.

That pattern lets us run the management command outside of a deploy, which is useful if it will take a very long time, while (ideally) not requiring anything of self-hosters besides regular deploys.

In this case, the management command didn't properly flag that the operation was complete, so the django migration attempts to run it. However, this migration required a specific older commit to run, so this didn't work.

1 Like

Thanks Ethan, I've just run it. I'll take the server offline for users, take a snapshot and deploy shortly, appreciate your input!
Ed