Multi-worker support in Pytorch Dataset #147

eddyxu · 2022-09-08T18:04:01Z

Closes #145

eddyxu · 2022-09-08T19:06:43Z

g5.4xlarge
Ubuntu 22.04
Number of workers: 8
Epoch: 10
Pytorch: EfficientNet B0

changhiskhan

can we make it so the max_rows_per_file and max_rows_per_group is passed in explicitly via commandline options in parse_pet.py? the default behavior should be the same as before so analytics performance stays the same if we need to re-run benchmark numbers

changhiskhan · 2022-09-09T19:58:14Z

python/benchmarks/parse_pet.py

+                partitioning=["split"],
+                existing_data_behavior="overwrite_or_ignore",
+                max_rows_per_group=128,
+                max_rows_per_file=256, # Create enough files for parallism


parallelism

changhiskhan · 2022-09-09T20:01:24Z

python/lance/__init__.py



 def dataset(
    uri: str,
-) -> ds.Dataset:
+) -> ds.FileSystemDataset:


why not leave it more generic? do we use FileSystemDataset-specific APIs?

FileSystemDataset has this dataset.files attributes.

changhiskhan · 2022-09-09T20:05:48Z

python/lance/pytorch/data.py

+        self._files = dataset(self.root).files
+        worker_info = torch.utils.data.get_worker_info()
+        if worker_info:
+            # Split the work using at the files level for now.


so just to check my understanding, this is what's forcing us to split into many smaller files right now right? Theoretically if we just have some num_rows based parallelism we could have good analytics performance and good training scan performance?

eddyxu added 3 commits September 8, 2022 10:58

divide works by files

2b0e15a

comments

afdfe8d

fix black

516e99e

eddyxu self-assigned this Sep 8, 2022

create smaller pet files for parallism

3dcdf45

eddyxu requested a review from changhiskhan September 8, 2022 19:04

eddyxu marked this pull request as ready for review September 8, 2022 19:04

eddyxu added enhancement New feature or request python PyTorch PyTorch support labels Sep 8, 2022

isort

15423db

changhiskhan approved these changes Sep 9, 2022

View reviewed changes

pass max row group size and max file size via cli

b697571

eddyxu force-pushed the lei/pytorch_multiproc branch from 0be314b to b697571 Compare September 9, 2022 20:33

eddyxu merged commit 73855d1 into main Sep 9, 2022

eddyxu deleted the lei/pytorch_multiproc branch September 9, 2022 21:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-worker support in Pytorch Dataset #147

Multi-worker support in Pytorch Dataset #147

eddyxu commented Sep 8, 2022 •

edited

Loading

eddyxu commented Sep 8, 2022 •

edited

Loading

changhiskhan left a comment

changhiskhan Sep 9, 2022

changhiskhan Sep 9, 2022

eddyxu Sep 9, 2022

changhiskhan Sep 9, 2022

eddyxu Sep 9, 2022

Multi-worker support in Pytorch Dataset #147

Multi-worker support in Pytorch Dataset #147

Conversation

eddyxu commented Sep 8, 2022 • edited Loading

eddyxu commented Sep 8, 2022 • edited Loading

changhiskhan left a comment

Choose a reason for hiding this comment

changhiskhan Sep 9, 2022

Choose a reason for hiding this comment

changhiskhan Sep 9, 2022

Choose a reason for hiding this comment

eddyxu Sep 9, 2022

Choose a reason for hiding this comment

changhiskhan Sep 9, 2022

Choose a reason for hiding this comment

eddyxu Sep 9, 2022

Choose a reason for hiding this comment

eddyxu commented Sep 8, 2022 •

edited

Loading

eddyxu commented Sep 8, 2022 •

edited

Loading