Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-worker support in Pytorch Dataset #147

Merged
merged 6 commits into from
Sep 9, 2022
Merged

Conversation

eddyxu
Copy link
Contributor

@eddyxu eddyxu commented Sep 8, 2022

Closes #145

@eddyxu eddyxu self-assigned this Sep 8, 2022
@eddyxu eddyxu requested a review from changhiskhan September 8, 2022 19:04
@eddyxu eddyxu marked this pull request as ready for review September 8, 2022 19:04
@eddyxu
Copy link
Contributor Author

eddyxu commented Sep 8, 2022

Screenshot 2022-09-08 at 12 06 17 PM

g5.4xlarge
Ubuntu 22.04
Number of workers: 8
Epoch: 10
Pytorch: EfficientNet B0

@eddyxu eddyxu added enhancement New feature or request python PyTorch PyTorch support labels Sep 8, 2022
Copy link
Contributor

@changhiskhan changhiskhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make it so the max_rows_per_file and max_rows_per_group is passed in explicitly via commandline options in parse_pet.py? the default behavior should be the same as before so analytics performance stays the same if we need to re-run benchmark numbers

partitioning=["split"],
existing_data_behavior="overwrite_or_ignore",
max_rows_per_group=128,
max_rows_per_file=256, # Create enough files for parallism
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parallelism



def dataset(
uri: str,
) -> ds.Dataset:
) -> ds.FileSystemDataset:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not leave it more generic? do we use FileSystemDataset-specific APIs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileSystemDataset has this dataset.files attributes.

self._files = dataset(self.root).files
worker_info = torch.utils.data.get_worker_info()
if worker_info:
# Split the work using at the files level for now.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so just to check my understanding, this is what's forcing us to split into many smaller files right now right? Theoretically if we just have some num_rows based parallelism we could have good analytics performance and good training scan performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@eddyxu eddyxu force-pushed the lei/pytorch_multiproc branch from 0be314b to b697571 Compare September 9, 2022 20:33
@eddyxu eddyxu merged commit 73855d1 into main Sep 9, 2022
@eddyxu eddyxu deleted the lei/pytorch_multiproc branch September 9, 2022 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python PyTorch PyTorch support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pytorch DataLoader mulit-process support
2 participants