Running commcare data export --since or automatic incremental

Jos_van_der_Ent · January 7, 2019, 1:40pm

Hi All,
I am currently migrating our database from Postgres to Sql server on Azure. I have the export tool scripted to run every hour with --since parameter and assumed the incremental filling was used (as stated in the documentation. Our base data is in the sql server, also data corrections (e.g. typing errors etc) are done in the database, so we need to be sure the corrections are not undone/overwritten by the export tool.
During the migration I became a bit confused on the auto increment option and --since parameter when running the data export tool.

From what I understood you can provide a since statement and any records from before that point won't be transferred (commcare_export_runs table).
Than there is also the auto increment, that the tool only transfers newer entries since last (successful) run.

Now when I ran the new updated version of export tool a warning message was shown:

2019-01-07 15:46:14,416 commcare_export.cli WARNING Checkpointing disabled when using '--since' or '--until'

Q1 Are the above statements correct?

Q2 also assuming one should choose between --since parameter or auto increment: if one chooses auto increment and changes are made to the excel reference file, the system will find a new md5 hash and so reload all data. Is that correct and could it be avoided?

How do other persons here run the export tool?
e.g. Running the script hourly and make the since statement move along with the current date (e.g. yesterday) and considering the second question how to avoid the tool to rerun everything once the md5 hash is changed e.g. due to updated Excel file?

-- update: was able to reproduce the error message, so added it to question above

Clayton_Sims · January 23, 2019, 9:03pm

Hi Jos,

In SQL Mode, the export tool will keep track of its progress on each export automatically in a DB table that will then be usable for incremental updating. The "--since" flag is really only useful for one-time exports, since it essentially overrides the cue for the incremental syncing.

Unfortunately, I don't know of an obvious way to prevent the MD5 change from re-spooling the export. Since the table's columns may have changed, and the export details may have changed, it's not clear how the tool would resolve any ambiguity between the two data models. I've recently helped some teams spin up the export tool with huge, huge datasets (millions of records) and we've been able to rebuild the database in less than a day, so it may not be necessary. One thing that we have found helpful when the tool is taking lots of time is to use a cloud setup to perform the initial "bootstrap" of data, then moving that bootstrapped database from the cloud down to a local server, which can then do incremental syncs afterwards.

-Clayton

Simon_Kelly · January 25, 2019, 10:05am

Hi Jos

There is actually a way to make the tool ignore changes to the query file. This is a fairly new feature that was not yet documented but I've now put up the docs for it: https://confluence.dimagi.com/display/commcarepublic/CommCare+Export+Advanced+Features#CommCareExportAdvancedFeatures-Userdefinedexportkeyforcheckpointing

Please let me know if there is anything unclear.

Regarding the lack of checkpointing when using --since, this has come up before but it's not clear exactly how the tool would know when to use the date from the command line vs the date in the checkpoint.

Jos_van_der_Ent · January 25, 2019, 11:47am

Hi Clatyon, Simon,
I understand in cases you want a complete copy of the Commcare data it works fine. In our case once inside the database we do our corrections in the database and do not want any tool to automatically override it.

@Simon_Kelly I think the new feature looks exactly what we need: Using the incremental adding and when a attribute/column is added the tool is allowed to add it for the new entries, but never change something from a certain earlier checkpoint.

Only from the documentation these things I do not fully understand yet:
I tried running the history command, but I got an error that the URL was missing an argument.:

commcare-export-utils: error: unrecognized arguments: URL

For the output argument I left the [SQL URL], am I supposed to replace that with something?

In case you want to run the tool always to only load new data, can I assume the key should be static since it replaces the md5 hash? Does it have any restrictions or can it be any string value?

Running forward on the questions above I started to ask myself the question how are case tables currently handled? is that based on 'server_date_modified' column and overwrites/updates all newer entries?

Simon_Kelly · January 28, 2019, 10:47am

Hi Jos

The tool is designed to make updates to the data as new data becomes available. The only way to enforce that no updates to existing rows are made is to prevent updates to the data in CommCare. For forms this probably isn't a big issues since forms are mostly write once however changes to forms will get pulled down by the tool e.g. archive / unarchive, edits.

Since cases get modified over time the export tool will continually pull down updates to cases (based on the 'server_date_modified' value) and the data in the export database will represent the 'snapshot' of the case at the time of export.

In terms of the history command, you should replace [SQL URL] with the connection URL to your database (same parameter you use when running the export tool, the value of the 'output' parameter).

Jos_van_der_Ent · January 28, 2019, 8:46pm

Hi Simon,
I had better look at your suggestions.
When replacing [SQL URL] for my connection string I got another error (only copied last error:

Listing checkpoints (most recent 10):
project: safisana
query filename: ProcessingPlant.xlsx
| Checkpoint Time | Batch end date | Export Complete | Project | Query Filename | Query MD5 | Key | CommCare HQ |
Traceback (most recent call last):
File "/home/safisana/.local/bin/commcare-export-utils", line 11, in
sys.exit(entry_point())
File "/home/safisana/.local/lib/python3.6/site-packages/commcare_export/utils_cli.py", line 161, in entry_point
main(sys.argv[1:])
File "/home/safisana/.local/lib/python3.6/site-packages/commcare_export/utils_cli.py", line 152, in main
exit(main_with_args(args))
File "/home/safisana/.local/lib/python3.6/site-packages/commcare_export/utils_cli.py", line 157, in main_with_args
command().run(args)
File "/home/safisana/.local/lib/python3.6/site-packages/commcare_export/utils_cli.py", line 59, in run
print_runs(runs)
File "/home/safisana/.local/lib/python3.6/site-packages/commcare_export/utils.py", line 54, in print_runs
'rows': rows
File "/home/safisana/.local/lib/python3.6/site-packages/commcare_export/writers.py", line 238, in write_table
self.output_stream.write('| %s |\n' % row_template.format(*['-' * width for width in col_widths]))
IndexError: tuple index out of range

I tried checking the specific utils_cli and utils from the repository to see if I can find the reason, but I am not skilled enough to resolve it. (this happend in (K)Ubuntu OS and Windows.

I also tried bypassing this step to direct set the custom key by running the last command in the documentation:

commcare-export-utils set-key --project X --query A.xlsx --output [SQL URL] --checkpoint-key my-key

I tried a self made hash and the last md5 hash generated by the export tool, but with both an error was thrown that I should use a value from history command:

usage: commcare-export-utils [-h] {history,set-checkpoint-key} ...
commcare-export-utils: error: argument command: invalid choice: 'set-key' (choose from 'history', 'set-checkpoint-key')

Could this error with the history command be due to the fact that the MD5 hash in the table 'commcare-export-runs' is not similar to my current xlsx file (made updates since running tool without --since statement?

Jos_van_der_Ent · February 1, 2019, 9:00am

Sorry for the double post:
I did another upgrade of all packages and now I got the history command to work.

Than when trying to convert to the 'key filling'
@Simon_Kelly It appear the wiki page still has an error:

$ commcare-export-utils set-key --project X --query A.xlsx --output [SQL URL] --checkpoint-key my-key

With this command I get an error that 'set-key' is supposed to the be 'set-checkpoint-key'.
When using the 'set-checkpoint-key' and use the last MD5 as key I get the following message:

No checkpoint found with args matching those provided.

So it looks like I am not providing the right argument here. Any suggestions?

Simon_Kelly · February 5, 2019, 9:15am

Hi Jos, apologies for the slow response and for getting the command name wrong. You are correct that it should be set-checkpoint-key.

The arguments that you need to provide are the same ones you normally use when running the commcare-export command:

--project: this should be the name of the project on CommCare HQ
--query: the path to the query file you are using to run the export
--output: the DB url you are using to export the data to
--checkpoint-key: any key you like that is going to be unique to this export

You should be able to use the history command to list the checkpoints using the same arguments (excluding the checkpoint-key arg).

Jos_van_der_Ent · February 5, 2019, 2:27pm

HI Simon,
I am actually able to run the history command, but than when I add the ' --checkpoint-key 20190205' (and replace history with ‘set-checkpoint-key’. I still get the following error:

No checkpoint found with args matching those provided.

I tried running it with 'test' query file, than all seems to work, as written in the documentation (and rows are added to the commcare_export_runs table)

Could it be that the md5 hash of my current query file is different from the one which is the the commcare_export_runs table?

Simon_Kelly · February 25, 2019, 3:00pm

Hi Jos

It is possible that the MD5 of the current query file is different but then when you run the history command with that query file it shouldn't list any checkpoints. If it's listing checkpoints with you run history but still giving you the error when you run set-checkpoint-key then that's likely a bug.

I did just test this again locally and was able to check the history and set the key successfully:

$ commcare-export-utils history --project demo --query queries/demo.xlsx  --output postgresql://user:pass@localhost/export

| Checkpoint Time            | Batch end date             | Export Complete | Project | Query Filename     | Query MD5                        | Key | CommCare HQ |
| -------------------------- | -------------------------- | --------------- | ------- | ------------------ | -------------------------------- | --- | ----------- |
| 2019-02-25T12:34:15.454245 | 2019-02-25T12:34:15.434912 | True            | demo    | queries/demo.xlsx  | 5d137f728c8ddd13c75966d9ee56a171 |     | prod        |

$commcare-export-utils set-checkpoint-key --project demo --query queries/demo.xlsx  --output postgresql://user:pass@localhost/export --checkpoint-key demo_export_1

| Checkpoint Time            | Batch end date             | Export Complete | Project | Query Filename     | Query MD5                        | Key | CommCare HQ |
| -------------------------- | -------------------------- | --------------- | ------- | ------------------ | -------------------------------- | --- | ----------- |
| 2019-02-25T12:34:15.454245 | 2019-02-25T12:34:15.434912 | True            | demo    | queries/demo.xlsx  | 5d137f728c8ddd13c75966d9ee56a171 |     | prod        |

        Do you want to set the key for this checkpoint to 'demo_export_1'? [y/N]
        y

Updated checkpoint:

| Checkpoint Time            | Batch end date             | Export Complete | Project | Query Filename     | Query MD5                        | Key           | CommCare HQ |
| -------------------------- | -------------------------- | --------------- | ------- | ------------------ | -------------------------------- | ------------- | ----------- |
| 2019-02-25T12:34:15.454245 | 2019-02-25T12:34:15.434912 | True            | demo    | queries/demo.xlsx  | 5d137f728c8ddd13c75966d9ee56a171 | demo_export_1 | prod        |

Jos_van_der_Ent · February 26, 2019, 3:29pm

I think it indeed was caused by the different md5 hash or so. I just made a backup of the database and ran the export tool without --since command (overwriting all my corrections) than placed back my backup except for the commcare_export_runs table and it is runs as discussed in the wiki pages.

Now the tool is writing only incremental data it also made the tool run much faster than before (ca 1 o2 min). Thanks for the feedback!

Simon_Kelly · February 26, 2019, 6:21pm

Great! I'm glad you were able to get it working.