Skip to content

Commit

Permalink
add disable_compose config (#46)
Browse files Browse the repository at this point in the history
* add disable_compose config

* update readme
  • Loading branch information
jdnurme authored Jul 10, 2024
1 parent 9383a14 commit 91881cc
Show file tree
Hide file tree
Showing 3 changed files with 16 additions and 3 deletions.
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -325,20 +325,21 @@ To optimize the download performance of small files, the Accelerated Dataloader
gcloud storage rm --recursive gs://<my-bucket>/dataflux-composed-objects/
```

You can also turn off this behavior by setting the “max_composite_object_size” parameter to 0 when constructing the dataset. Example:

You can turn of this behavior by setting the "disable_compose" parameter to true, or by setting the “max_composite_object_size” parameter to 0 when constructing the dataset. Example:
```python
dataset = dataflux_mapstyle_dataset.DataFluxMapStyleDataset(
project_name=PROJECT_NAME,
bucket_name=BUCKET_NAME,
config=dataflux_mapstyle_dataset.Config(
prefix=PREFIX,
max_composite_object_size=0,
disable_compose=True,
),
)
```

Note that turning off this behavior will cause the training loop to take significantly longer to complete when working with small files.
Note that turning off this behavior may cause the training loop to take significantly longer to complete when working with small files. However, composed download will hit QPS and throughput limits at a lower scale than downloading files directly, so you should disable this behavior when running at high multi-node scales where you are able to hit project QPS or throughput limits without composed download.


### Soft Delete
To avoid storage charges for retaining the temporary composite objects, consider disabling the [Soft Delete](https://cloud.google.com/storage/docs/soft-delete) retention duration on the bucket.
Expand Down
6 changes: 6 additions & 0 deletions dataflux_pytorch/dataflux_iterable_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,9 @@ class Config:
max_listing_retries: An integer indicating the maximum number of retries
to attempt in case of any Python multiprocessing errors during
GCS objects listing. Default to 3.
disable_compose: A boolean flag indicating if compose download should be active.
Compose should be disabled for highly scaled implementations.
"""

def __init__(
Expand All @@ -53,12 +56,15 @@ def __init__(
num_processes: int = os.cpu_count(),
prefix: str = None,
max_listing_retries: int = 3,
disable_compose: bool = False,
):
self.sort_listing_results = sort_listing_results
self.max_composite_object_size = max_composite_object_size
self.num_processes = num_processes
self.prefix = prefix
self.max_listing_retries = max_listing_retries
if disable_compose:
self.max_composite_object_size = 0


class DataFluxIterableDataset(data.IterableDataset):
Expand Down
6 changes: 6 additions & 0 deletions dataflux_pytorch/dataflux_mapstyle_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,9 @@ class Config:
max_listing_retries: An integer indicating the maximum number of retries
to attempt in case of any Python multiprocessing errors during
GCS objects listing. Default to 3.
disable_compose: A boolean flag indicating if compose download should be active.
Compose should be disabled for highly scaled implementations.
"""

def __init__(
Expand All @@ -53,13 +56,16 @@ def __init__(
prefix: str = None,
max_listing_retries: int = 3,
threads_per_process: int = 1,
disable_compose: bool = False,
):
self.sort_listing_results = sort_listing_results
self.max_composite_object_size = max_composite_object_size
self.num_processes = num_processes
self.prefix = prefix
self.max_listing_retries = max_listing_retries
self.threads_per_process = threads_per_process
if disable_compose:
self.max_composite_object_size = 0


class DataFluxMapStyleDataset(data.Dataset):
Expand Down

0 comments on commit 91881cc

Please sign in to comment.