I am currently migrating our database from Postgres to Sql server on Azure. I have the export tool scripted to run every hour with --since parameter and assumed the incremental filling was used (as stated in the documentation. Our base data is in the sql server, also data corrections (e.g. typing errors etc) are done in the database, so we need to be sure the corrections are not undone/overwritten by the export tool.
During the migration I became a bit confused on the auto increment option and --since parameter when running the data export tool.
From what I understood you can provide a since statement and any records from before that point won’t be transferred (commcare_export_runs table).
Than there is also the auto increment, that the tool only transfers newer entries since last (successful) run.
Now when I ran the new updated version of export tool a warning message was shown:
2019-01-07 15:46:14,416 commcare_export.cli WARNING Checkpointing disabled when using ‘–since’ or ‘–until’
Q1 Are the above statements correct?
Q2 also assuming one should choose between --since parameter or auto increment: if one chooses auto increment and changes are made to the excel reference file, the system will find a new md5 hash and so reload all data. Is that correct and could it be avoided?
How do other persons here run the export tool?
e.g. Running the script hourly and make the since statement move along with the current date (e.g. yesterday) and considering the second question how to avoid the tool to rerun everything once the md5 hash is changed e.g. due to updated Excel file?
– update: was able to reproduce the error message, so added it to question above
In SQL Mode, the export tool will keep track of its progress on each export automatically in a DB table that will then be usable for incremental updating. The “–since” flag is really only useful for one-time exports, since it essentially overrides the cue for the incremental syncing.
Unfortunately, I don’t know of an obvious way to prevent the MD5 change from re-spooling the export. Since the table’s columns may have changed, and the export details may have changed, it’s not clear how the tool would resolve any ambiguity between the two data models. I’ve recently helped some teams spin up the export tool with huge, huge datasets (millions of records) and we’ve been able to rebuild the database in less than a day, so it may not be necessary. One thing that we have found helpful when the tool is taking lots of time is to use a cloud setup to perform the initial “bootstrap” of data, then moving that bootstrapped database from the cloud down to a local server, which can then do incremental syncs afterwards.
Regarding the lack of checkpointing when using --since, this has come up before but it’s not clear exactly how the tool would know when to use the date from the command line vs the date in the checkpoint.
Hi Clatyon, Simon,
I understand in cases you want a complete copy of the Commcare data it works fine. In our case once inside the database we do our corrections in the database and do not want any tool to automatically override it.
@Simon_Kelly I think the new feature looks exactly what we need: Using the incremental adding and when a attribute/column is added the tool is allowed to add it for the new entries, but never change something from a certain earlier checkpoint.
Only from the documentation these things I do not fully understand yet:
I tried running the history command, but I got an error that the URL was missing an argument.:
For the output argument I left the [SQL URL], am I supposed to replace that with something?
In case you want to run the tool always to only load new data, can I assume the key should be static since it replaces the md5 hash? Does it have any restrictions or can it be any string value?
Running forward on the questions above I started to ask myself the question how are case tables currently handled? is that based on ‘server_date_modified’ column and overwrites/updates all newer entries?
The tool is designed to make updates to the data as new data becomes available. The only way to enforce that no updates to existing rows are made is to prevent updates to the data in CommCare. For forms this probably isn’t a big issues since forms are mostly write once however changes to forms will get pulled down by the tool e.g. archive / unarchive, edits.
Since cases get modified over time the export tool will continually pull down updates to cases (based on the ‘server_date_modified’ value) and the data in the export database will represent the ‘snapshot’ of the case at the time of export.
In terms of the history command, you should replace [SQL URL] with the connection URL to your database (same parameter you use when running the export tool, the value of the ‘output’ parameter).
I had better look at your suggestions.
When replacing [SQL URL] for my connection string I got another error (only copied last error:
Listing checkpoints (most recent 10):
query filename: ProcessingPlant.xlsx
| Checkpoint Time | Batch end date | Export Complete | Project | Query Filename | Query MD5 | Key | CommCare HQ |
Traceback (most recent call last):
File “/home/safisana/.local/bin/commcare-export-utils”, line 11, in
File “/home/safisana/.local/lib/python3.6/site-packages/commcare_export/utils_cli.py”, line 161, in entry_point
File “/home/safisana/.local/lib/python3.6/site-packages/commcare_export/utils_cli.py”, line 152, in main
File “/home/safisana/.local/lib/python3.6/site-packages/commcare_export/utils_cli.py”, line 157, in main_with_args
File “/home/safisana/.local/lib/python3.6/site-packages/commcare_export/utils_cli.py”, line 59, in run
File “/home/safisana/.local/lib/python3.6/site-packages/commcare_export/utils.py”, line 54, in print_runs
File “/home/safisana/.local/lib/python3.6/site-packages/commcare_export/writers.py”, line 238, in write_table
self.output_stream.write(’| %s |\n’ % row_template.format(*[’-’ * width for width in col_widths]))
IndexError: tuple index out of range
I tried checking the specific utils_cli and utils from the repository to see if I can find the reason, but I am not skilled enough to resolve it. (this happend in (K)Ubuntu OS and Windows.
I also tried bypassing this step to direct set the custom key by running the last command in the documentation:
Could this error with the history command be due to the fact that the MD5 hash in the table ‘commcare-export-runs’ is not similar to my current xlsx file (made updates since running tool without --since statement?
It is possible that the MD5 of the current query file is different but then when you run the history command with that query file it shouldn’t list any checkpoints. If it’s listing checkpoints with you run history but still giving you the error when you run set-checkpoint-key then that’s likely a bug.
I did just test this again locally and was able to check the history and set the key successfully:
I think it indeed was caused by the different md5 hash or so. I just made a backup of the database and ran the export tool without --since command (overwriting all my corrections) than placed back my backup except for the commcare_export_runs table and it is runs as discussed in the wiki pages.
Now the tool is writing only incremental data it also made the tool run much faster than before (ca 1 o2 min). Thanks for the feedback!