Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp download to local dir process #2223

Merged
merged 43 commits into from
Apr 29, 2024
Merged
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
1004434
still an early draft
Wauplin Apr 12, 2024
5a8605e
this is better
Wauplin Apr 12, 2024
68a6cf1
fix
Wauplin Apr 12, 2024
9b25f38
Merge branch 'main' intto 1738-revampt-download-local-dir
Wauplin Apr 24, 2024
8e903f8
revampt/refactor download process
Wauplin Apr 24, 2024
5f610ee
resume download by default + do not upload .huggingface folder
Wauplin Apr 24, 2024
5a9762f
compute sha256 if necessary
Wauplin Apr 24, 2024
283977d
fix hash
Wauplin Apr 24, 2024
e909022
add tests + fix some stuff
Wauplin Apr 24, 2024
39cfef4
fix snapshot download tests
Wauplin Apr 24, 2024
dbece97
fix test
Wauplin Apr 24, 2024
0206964
lots of docs
Wauplin Apr 24, 2024
82b46b3
add secu
Wauplin Apr 24, 2024
3300b28
as constant
Wauplin Apr 24, 2024
c606a94
dix
Wauplin Apr 24, 2024
95171ef
fix tests
Wauplin Apr 24, 2024
7180746
remove unused code
Wauplin Apr 24, 2024
4e664d4
don't use jsons
Wauplin Apr 24, 2024
7bb263e
style
Wauplin Apr 24, 2024
3595042
Apply suggestions from code review
Wauplin Apr 25, 2024
3401880
Apply suggestions from code review
Wauplin Apr 25, 2024
9210648
Warn more about resume_download
Wauplin Apr 25, 2024
fb477e5
fix test
Wauplin Apr 25, 2024
0eacbc9
Add tests specific to .huggingface folder
Wauplin Apr 25, 2024
1a4320a
remove advice to use hf_transfer when downloading from cli
Wauplin Apr 25, 2024
8c9dc8b
fix torhc test
Wauplin Apr 25, 2024
6260a17
more test fix
Wauplin Apr 25, 2024
c788e2d
Merge branch 'main' into 1738-revampt-download-local-dir
Wauplin Apr 25, 2024
3a45f4b
feedback
Wauplin Apr 25, 2024
4f6f531
suggested changes
Wauplin Apr 26, 2024
41c8ae3
more robust
Wauplin Apr 26, 2024
84f55ff
Apply suggestions from code review
Wauplin Apr 29, 2024
f5d1faa
comment
Wauplin Apr 29, 2024
edc9790
commen
Wauplin Apr 29, 2024
a768426
Merge branch 'main' into 1738-revampt-download-local-dir
Wauplin Apr 29, 2024
d0ea3ea
robust tests
Wauplin Apr 29, 2024
dffa539
fix CI
Wauplin Apr 29, 2024
d414825
ez
Wauplin Apr 29, 2024
9e6d569
more ribust?
Wauplin Apr 29, 2024
e6fe766
allow for 1s diff
Wauplin Apr 29, 2024
28991c9
don't raise on unlink
Wauplin Apr 29, 2024
a0b61a1
style
Wauplin Apr 29, 2024
fccabe0
robustenss
Wauplin Apr 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ quality:
mypy src

style:
ruff check --fix $(check_dirs) # linter
ruff format $(check_dirs) # formatter
ruff check --fix $(check_dirs) # linter
python utils/check_contrib_list.py --update
python utils/check_static_imports.py --update
python utils/generate_async_inference_client.py --update
Expand Down
12 changes: 7 additions & 5 deletions docs/source/en/guides/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -224,18 +224,20 @@ The examples above show how to download from the latest commit on the main branc

### Download to a local folder

The recommended (and default) way to download files from the Hub is to use the cache-system. However, in some cases you want to download files and move them to a specific folder. This is useful to get a workflow closer to what git commands offer. You can do that using the `--local_dir` option. The file is downloaded to a tmp file and then moved to the local dir to avoid having partially downloaded files in the local folder.
The recommended (and default) way to download files from the Hub is to use the cache-system. However, in some cases you want to download files and move them to a specific folder. This is useful to get a workflow closer to what git commands offer. You can do that using the `--local_dir` option.

<Tip warning={true}>
Note that a `.huggingface/` folder will be created at the root of your local directory, containing metadata about the downloaded files. This prevents re-downloading files if you re-run the command. While this mechanism is not as robust as the main cache-system, it's optimized for regularly pulling the latest version of a repository.
Wauplin marked this conversation as resolved.
Show resolved Hide resolved

Downloading to a local directory comes with some downsides. Please check out the limitations in the [Download](./download#download-files-to-local-folder) guide before using `--local-dir`.
<Tip>

For more details on how downloading to a local file works, check out the [download](./download.md#download-files-to-a-local-folder) guide.

</Tip>

```bash
>>> huggingface-cli download adept/fuyu-8b model-00001-of-00002.safetensors --local-dir .
>>> huggingface-cli download adept/fuyu-8b model-00001-of-00002.safetensors --local-dir fuyu
...
./model-00001-of-00002.safetensors
fuyu/model-00001-of-00002.safetensors
```

### Specify cache directory
Expand Down
55 changes: 19 additions & 36 deletions docs/source/en/guides/download.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,42 +126,25 @@ files except `vocab.json`.
>>> snapshot_download(repo_id="gpt2", allow_patterns=["*.md", "*.json"], ignore_patterns="vocab.json")
```

## Download file(s) to local folder

The recommended (and default) way to download files from the Hub is to use the [cache-system](./manage-cache).
You can define your cache location by setting `cache_dir` parameter (both in [`hf_hub_download`] and [`snapshot_download`]).

However, in some cases you want to download files and move them to a specific folder. This is useful to get a workflow
closer to what `git` commands offer. You can do that using the `local_dir` and `local_dir_use_symlinks` parameters:
- `local_dir` must be a path to a folder on your system. The downloaded files will keep the same file structure as in the
repo. For example if `filename="data/train.csv"` and `local_dir="path/to/folder"`, then the returned filepath will be
`"path/to/folder/data/train.csv"`.
- `local_dir_use_symlinks` defines how the file must be saved in your local folder.
- The default behavior (`"auto"`) is to duplicate small files (<5MB) and use symlinks for bigger files. Symlinks allow
to optimize both bandwidth and disk usage. However manually editing a symlinked file might corrupt the cache, hence
the duplication for small files. The 5MB threshold can be configured with the `HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD`
environment variable.
- If `local_dir_use_symlinks=True` is set, all files are symlinked for an optimal disk space optimization. This is
for example useful when downloading a huge dataset with thousands of small files.
- Finally, if you don't want symlinks at all you can disable them (`local_dir_use_symlinks=False`). The cache directory
will still be used to check whether the file is already cached or not. If already cached, the file is **duplicated**
from the cache (i.e. saves bandwidth but increases disk usage). If the file is not already cached, it will be
downloaded and moved directly to the local dir. This means that if you need to reuse it somewhere else later, it
will be **re-downloaded**.

Here is a table that summarizes the different options to help you choose the parameters that best suit your use case.

<!-- Generated with https://www.tablesgenerator.com/markdown_tables -->
| Parameters | File already cached | Returned path | Can read path? | Can save to path? | Optimized bandwidth | Optimized disk usage |
|---|:---:|:---:|:---:|:---:|:---:|:---:|
| `local_dir=None` | | symlink in cache | βœ… | ❌<br>_(save would corrupt the cache)_ | βœ… | βœ… |
| `local_dir="path/to/folder"`<br>`local_dir_use_symlinks="auto"` | | file or symlink in folder | βœ… | βœ… _(for small files)_ <br> ⚠️ _(for big files do not resolve path before saving)_ | βœ… | βœ… |
| `local_dir="path/to/folder"`<br>`local_dir_use_symlinks=True` | | symlink in folder | βœ… | ⚠️<br>_(do not resolve path before saving)_ | βœ… | βœ… |
| `local_dir="path/to/folder"`<br>`local_dir_use_symlinks=False` | No | file in folder | βœ… | βœ… | ❌<br>_(if re-run, file is re-downloaded)_ | ⚠️<br>(multiple copies if ran in multiple folders) |
| `local_dir="path/to/folder"`<br>`local_dir_use_symlinks=False` | Yes | file in folder | βœ… | βœ… | ⚠️<br>_(file has to be cached first)_ | ❌<br>_(file is duplicated)_ |

**Note:** if you are on a Windows machine, you need to enable developer mode or run `huggingface_hub` as admin to enable
symlinks. Check out the [cache limitations](../guides/manage-cache#limitations) section for more details.
## Download file(s) to a local folder

By default, we recommend using the [cache system](./manage-cache) to download files from the Hub. You can specify a custom cache location using the `cache_dir` parameter in [`hf_hub_download`] and [`snapshot_download`], or by setting the [`HF_HOME`](../package_reference/environment_variables#hf_home) environment variable.

However, if you need to download files to a specific folder, you can pass a `local_dir` parameter to the download function. This is useful to get a workflow closer to what `git` commands offer. The downloaded files will maintain their original file structure within the specified folder. For example, if `filename="data/train.csv"` and `local_dir="path/to/folder"`, the resulting filepath will be `"path/to/folder/data/train.csv"`.
Wauplin marked this conversation as resolved.
Show resolved Hide resolved

Note that a `.huggingface/` folder will be created at the root of your local directory, containing metadata about the downloaded files. This prevents re-downloading files if you re-run your script. While this mechanism is not as robust as the main cache-system, it's optimized for regularly pulling the latest version of a repository.
Wauplin marked this conversation as resolved.
Show resolved Hide resolved

<Tip>
Wauplin marked this conversation as resolved.
Show resolved Hide resolved

After completing the download, you can safely remove the `.huggingface/` folder if you no longer need it. However, be aware that re-running your script without this folder may result in longer recovery times, as metadata will be lost. Rest assured that your local data will remain intact and unaffected.

</Tip>

<Tip>

Don't worry about the `.huggingface/` folder when committing changes to the Hub! This folder is automatically ignored by both `git` and [`upload_folder`].

</Tip>

## Download from the CLI

Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/guides/integrations.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ common to offer parameters like:
- `token`: to download from a private repo
- `revision`: to download from a specific branch
- `cache_dir`: to cache files in a specific directory
- `force_download`/`resume_download`/`local_files_only`: to reuse the cache or not
- `force_download`/`local_files_only`: to reuse the cache or not
- `proxies`: configure HTTP session

When pushing models, similar parameters are supported:
Expand Down
5 changes: 1 addition & 4 deletions docs/source/en/package_reference/environment_variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,10 +67,7 @@ For more details, see [logging reference](../package_reference/utilities#hugging

### HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD

Integer value to define under which size a file is considered as "small". When downloading files to a local directory,
small files will be duplicated to ease user experience while bigger files are symlinked to save disk usage.

For more details, see the [download guide](../guides/download#download-files-to-local-folder).
This environment variable has been deprecated and is now ignored by `huggingface_hub`. Downloading files to the local dir does not rely on symlinks anymore.

### HF_HUB_ETAG_TIMEOUT

Expand Down
12 changes: 7 additions & 5 deletions src/huggingface_hub/_commit_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
from .file_download import hf_hub_url
from .lfs import UploadInfo, lfs_upload, post_lfs_batch_info
from .utils import (
FORBIDDEN_FOLDERS,
EntryNotFoundError,
chunk_iterable,
get_session,
Expand Down Expand Up @@ -254,11 +255,12 @@ def _validate_path_in_repo(path_in_repo: str) -> str:
raise ValueError(f"Invalid `path_in_repo` in CommitOperation: '{path_in_repo}'")
if path_in_repo.startswith("./"):
path_in_repo = path_in_repo[2:]
if any(part == ".git" for part in path_in_repo.split("/")):
raise ValueError(
"Invalid `path_in_repo` in CommitOperation: cannot update files under a '.git/' folder (path:"
f" '{path_in_repo}')."
)
for forbidden in FORBIDDEN_FOLDERS:
if any(part == forbidden for part in path_in_repo.split("/")):
raise ValueError(
f"Invalid `path_in_repo` in CommitOperation: cannot update files under a '{forbidden}/' folder (path:"
f" '{path_in_repo}')."
)
return path_in_repo


Expand Down
4 changes: 2 additions & 2 deletions src/huggingface_hub/_commit_scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
from threading import Lock, Thread
from typing import Dict, List, Optional, Union

from .hf_api import IGNORE_GIT_FOLDER_PATTERNS, CommitInfo, CommitOperationAdd, HfApi
from .hf_api import DEFAULT_IGNORE_PATTERNS, CommitInfo, CommitOperationAdd, HfApi
from .utils import filter_repo_objects


Expand Down Expand Up @@ -107,7 +107,7 @@ def __init__(
ignore_patterns = []
elif isinstance(ignore_patterns, str):
ignore_patterns = [ignore_patterns]
self.ignore_patterns = ignore_patterns + IGNORE_GIT_FOLDER_PATTERNS
self.ignore_patterns = ignore_patterns + DEFAULT_IGNORE_PATTERNS

if self.folder_path.is_file():
raise ValueError(f"'folder_path' must be a directory, not a file: '{self.folder_path}'.")
Expand Down
224 changes: 224 additions & 0 deletions src/huggingface_hub/_local_folder.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
# coding=utf-8
# Copyright 2024-present, the HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Contains utilities to handle the `../.huggingface` folder in local directories.

First discussed in https://github.com/huggingface/huggingface_hub/issues/1738 to store
download metadata when downloading files from the hub to a local directory (without
using the cache).

./.huggingface folder structure:
[4.0K] data
β”œβ”€β”€ [4.0K] .huggingface
β”‚ └── [4.0K] download
β”‚ β”œβ”€β”€ [ 16] file.parquet.metadata
β”‚ β”œβ”€β”€ [ 16] file.txt.metadata
β”‚ └── [4.0K] folder
β”‚ └── [ 16] file.parquet.metadata
β”‚
β”œβ”€β”€ [6.5G] file.parquet
β”œβ”€β”€ [1.5K] file.txt
└── [4.0K] folder
└── [ 16] file.parquet


Metadata file structure:
```
# file.txt.metadata
11c5a3d5811f50298f278a704980280950aedb10
a16a55fda99d2f2e7b69cce5cf93ff4ad3049930
1712656091.123

# file.parquet.metadata
11c5a3d5811f50298f278a704980280950aedb10
7c5d3f4b8b76583b422fcb9189ad6c89d5d97a094541ce8932dce3ecabde1421
1712656091.123
}
```
"""

import logging
import os
import time
from dataclasses import dataclass
from functools import lru_cache
from pathlib import Path
from typing import Optional

from .utils import WeakFileLock


logger = logging.getLogger(__name__)


@dataclass
class LocalDownloadFilePaths:
"""
Paths to the files related to a download process in a local dir.

Returned by `get_local_download_paths`.

Attributes:
file_path (`Path`):
Path where the file will be saved.
lock_path (`Path`):
Path to the lock file used to ensure atomicity when reading/writing metadata.
metadata_path (`Path`):
Path to the metadata file.
"""

file_path: Path
lock_path: Path
metadata_path: Path

def incomplete_path(self, etag: str) -> Path:
"""Return the path where a file will be temporarily downloaded before being moved to `file_path`."""
return self.metadata_path.with_suffix(f".{etag}.incomplete")


@dataclass
class LocalDownloadFileMetadata:
"""
Metadata about a file in the local directory related to a download process.

Attributes:
filename (`str`):
Path of the file in the repo.
commit_hash (`str`):
Commit hash of the file in the repo.
etag (`str`):
ETag of the file in the repo. Used to check if the file has changed.
For LFS files, this is the sha256 of the file. For regular files, it corresponds to the git hash.
timestamp (`int`):
Unix timestamp of when the metadata was saved i.e. when the metadata was accurate.
"""

filename: str
commit_hash: str
etag: str
timestamp: float


@lru_cache(maxsize=128) # ensure singleton
def get_local_download_paths(local_dir: Path, filename: str) -> LocalDownloadFilePaths:
"""Compute paths to the files related to a download process.

Folders containing the paths are all guaranteed to exist.

Args:
local_dir (`Path`):
Path to the local directory in which files are downloaded.
filename (`str`):
Path of the file in the repo.

Return:
[`LocalDownloadFilePaths`]: the paths to the files (file_path, lock_path, metadata_path, incomplete_path).
"""
# filename is the path in the Hub repository (separated by '/')
# make sure to have a cross platform transcription
sanitized_filename = os.path.join(*filename.split("/"))
if os.name == "nt":
if sanitized_filename.startswith("..\\") or "\\..\\" in sanitized_filename:
raise ValueError(
f"Invalid filename: cannot handle filename '{sanitized_filename}' on Windows. Please ask the repository"
" owner to rename this file."
)
file_path = local_dir / sanitized_filename
metadata_path = _huggingface_dir(local_dir) / "download" / f"{sanitized_filename}.metadata"
lock_path = metadata_path.with_suffix(".lock")

file_path.parent.mkdir(parents=True, exist_ok=True)
metadata_path.parent.mkdir(parents=True, exist_ok=True)
return LocalDownloadFilePaths(file_path=file_path, lock_path=lock_path, metadata_path=metadata_path)


def read_download_metadata(local_dir: Path, filename: str) -> Optional[LocalDownloadFileMetadata]:
"""Read metadata about a file in the local directory related to a download process.

Args:
local_dir (`Path`):
Path to the local directory in which files are downloaded.
filename (`str`):
Path of the file in the repo.

Return:
`[LocalDownloadFileMetadata]` or `None`: the metadata if it exists, `None` otherwise.
"""
paths = get_local_download_paths(local_dir, filename)
# file_path = local_file_path(local_dir, filename)
# lock_path, metadata_path = _download_metadata_file_path(local_dir, filename)
with WeakFileLock(paths.lock_path):
if paths.metadata_path.exists():
try:
with paths.metadata_path.open() as f:
commit_hash = f.readline().strip()
etag = f.readline().strip()
timestamp = float(f.readline().strip())
metadata = LocalDownloadFileMetadata(
filename=filename,
commit_hash=commit_hash,
etag=etag,
timestamp=timestamp,
)
except Exception as e:
# remove the metadata file if it is corrupted / not the right format
logger.warning(
f"Invalid metadata file {paths.metadata_path}: {e}. Removing it from disk and continue."
)
try:
paths.metadata_path.unlink()
except Exception as e:
logger.warning(f"Could not remove corrupted metadata file {paths.metadata_path}: {e}")

try:
# check if the file exists and hasn't been modified since the metadata was saved
stat = paths.file_path.stat()
if stat.st_mtime <= metadata.timestamp:
return metadata
logger.info(f"Ignored metadata for '{filename}' (outdated). Will re-compute hash.")
except FileNotFoundError:
# file does not exist => metadata is outdated
return None
return None


def write_download_metadata(local_dir: Path, filename: str, commit_hash: str, etag: str) -> None:
"""Write metadata about a file in the local directory related to a download process.

Args:
local_dir (`Path`):
Path to the local directory in which files are downloaded.
"""
paths = get_local_download_paths(local_dir, filename)
with WeakFileLock(paths.lock_path):
with paths.metadata_path.open("w") as f:
f.write(f"{commit_hash}\n{etag}\n{time.time()}\n")


@lru_cache()
def _huggingface_dir(local_dir: Path) -> Path:
"""Return the path to the `.huggingface` directory in a local directory."""
# Wrap in lru_cache to avoid overwriting the .gitignore file if called multiple times
path = local_dir / ".huggingface"
path.mkdir(exist_ok=True, parents=True)

# Create a .gitignore file in the .huggingface directory if it doesn't exist
# Should be thread-safe enough like this.
gitignore = path / ".gitignore"
gitignore_lock = path / ".gitignore.lock"
if not gitignore.exists():
with WeakFileLock(gitignore_lock):
gitignore.write_text("*")
gitignore_lock.unlink(missing_ok=True)
return path
Loading
Loading