-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revamp download to local dir process #2223
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Super nice side effect of having resume_downloads
on by default.
I played with it locally and we already discussed potential improvements.
The rest LGTM!
Co-authored-by: Lysandre Debut <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! (from a distance :))
Co-authored-by: Pedro Cuenca <[email protected]>
Addressed all comments and fixed the CI (failures were not only due to this PR but I fixed them here anyway). Let's get this merged! 🎉 |
Implements #1738 (and especially #1738 (comment)) 🙈
What does this PR do?
local_dir
do not use the cache but rely on a.huggingface/
folder instead.huggingface/
from being committedhf_transfer
is enabled, we do not resume download (not supported). One can useforce_download
to force a download from scratch.How it works?
When downloading a file
file.txt
to local dirdata/
:data/file.txt
existsdata/.huggingface/download/file.txt.metadata
existsdata/file.txt
has been modified before the metadata file was saved (metadata contains a timestamp)commit_hash
andetag
. Otherwise we consider that we don't have any info on the local file.revision == metadata.commit_hash
, then the file is valid => returnremote etag == metadata.etag
=> update local metadata => returnremote tag
is a sha256 and we don't have local metadata => we hash local file. If sha256 == remote etag => it's a valid LFS file => returnIf
force_download=True
is passed, all of the above is skipped => we download the file no matter what.What to review?
This is a large PR (
+1,265 −760
) as it touches in depth the download logic. However a lot of the changes are about moving parts of code into private helpers to avoid duplicating the logic between_hf_hub_download_local_dir
and_hf_hub_download_cache_dir
.Important changes are:
_local_folder.py
=> handles metadata in the local folder (i.e. inside./huggingface/
)file_download.py
=> where everything happens. Best to read the file instead of raw changes. Most important part is_hf_hub_download_local_dir
whilehf_hub_download
and_hf_hub_download_cache_dir
are iso-feature compared to before.test_file_download.py
: all the new test cases inHfHubDownloadToLocalDir
Doc changes:
cli.md
,download.md
,environment_variables.md
snapshot_download.py
=> only some docs + few tweaks, no real updatehf_api.py
=> only some docs + few tweaks, no real updateLess important:
huggingface/
folder (_commit_api.py
+test_commit_api.py
+test_utils_paths.py
)test_cli.py
=> not relevantcommand/download.py
=> deprecated--local-dir-use-symlinks
in CLIconstants.py
/hub_mixin.py
/keras_mixin.py
=> some deprecationExample
Download
README.md
andmodel.safetensors
fromgpt2
repo into./data/gpt2
folder:Resulting tree:
# tree -alh data/ [4.0K] data/ └── [4.0K] gpt2 ├── [4.0K] .huggingface │ ├── [4.0K] download │ │ ├── [ 0] model.safetensors.lock │ │ ├── [ 182] model.safetensors.metadata │ │ ├── [ 0] README.md.lock │ │ └── [ 158] README.md.metadata │ └── [ 1] .gitignore ├── [523M] model.safetensors └── [7.9K] README.md 3 directories, 7 files
How to try it?
TODO
.huggingface/
works?" (similar to the "Cache" guide?)