Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to doit #5

Merged
merged 62 commits into from
Jan 13, 2025
Merged

Switch to doit #5

merged 62 commits into from
Jan 13, 2025

Conversation

crdanielbusch
Copy link
Contributor

@crdanielbusch crdanielbusch commented Dec 2, 2024

Description

Changes as discussed in #3

the make targets don't work for me because of locking issues

The set up now is pydoit -> datalad.api -> download / read script. The files are now unlocked before the command is executed.

Will this fail is you run it when not connected to the internet? I usually don't link read to download so data is not downloaded again if I have to read it again because of a processing code change

I don't remember linking the two commands, but it doesn't matter anyway, because we are using pydoit instead of make now.

This make target says "nothing to be done" after changing the data reading code. This should be changed, so we can re-read data after making changes to the code.

Re-reading should now be possible.

More general comment: I think it would e good to have an option to re-read older data as well,e.g. if we read additional data or variables.

Should be possible now. The config what data to read is in definitions.py. Maybe it should live somwhere else?

For me download_all_domains tries to read the data into primap2 format. This is not what I would intuitively expect it to do (also it fails)

I think something went wrong with the Makefile and how the commands are linked.

Category is to be changed, after the category tree is done?

Checklist

Please confirm that this pull request has done the following:

  • Tests added
  • Documentation added (where applicable)
  • Changelog item added to changelog/

Notes on implementation

  • Would be nice to test the doit command, but it seems rather complex to test it together with datalad.

=== Do not change lines below ===
{
 "chain": [],
 "cmd": "python3 scripts/read_data_set.py --save_path temp_test --run_id 2024",
 "dsid": "934d913e-8268-4342-aa0c-3702ee516d10",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [
  "downloaded_data/pre_post_agricultural_production/2024-11-14"
 ],
 "pwd": "."
}
^^^ Do not change lines above ^^^
=== Do not change lines below ===
{
 "chain": [],
 "cmd": "python3 scripts/read_data_set.py --save_path extracted_data --run_id 2024",
 "dsid": "934d913e-8268-4342-aa0c-3702ee516d10",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [
  "extracted_data"
 ],
 "pwd": "."
}
^^^ Do not change lines above ^^^
# ("land_use_fires", "2024-11-14"),
# ("land_use_forests", "2024-11-14"),
("pre_post_agricultural_production", "2024-11-14"),
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the configuration for the data reading step. That way we can choose which domain and release to read.

@crdanielbusch crdanielbusch mentioned this pull request Dec 10, 2024
3 tasks
=== Do not change lines below ===
{
 "chain": [],
 "cmd": "python3 scripts/download_all_domains.py",
 "dsid": "934d913e-8268-4342-aa0c-3702ee516d10",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [
  "downloaded_data"
 ],
 "pwd": "."
}
^^^ Do not change lines above ^^^
=== Do not change lines below ===
{
 "chain": [],
 "cmd": "python3 scripts/download_all_domains.py",
 "dsid": "934d913e-8268-4342-aa0c-3702ee516d10",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [
  "downloaded_data"
 ],
 "pwd": "."
}
^^^ Do not change lines above ^^^
=== Do not change lines below ===
{
 "chain": [],
 "cmd": "python3 scripts/download_all_domains.py",
 "dsid": "934d913e-8268-4342-aa0c-3702ee516d10",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [
  "downloaded_data"
 ],
 "pwd": "."
}
^^^ Do not change lines above ^^^
# Conflicts:
#	src/faostat_data_primap/download.py
if not save_path.is_symlink():
print(f"Skipping download of {save_path} because it already exists.")
return False
os.remove(save_path)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to solve the problem.

  • Check if the file to download already exists (will return True for symlinks as well)
  • Check if it's a symlink
    • if no, we have the file on disc (no need to download again)
    • if yes, it's not available on disc and datalad cannot unlock the file. In that case we need to delete the symlink before downloading

@crdanielbusch crdanielbusch merged commit 239167b into read-dataset Jan 13, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant