Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for hf datasets reader #490

Closed
wants to merge 23 commits into from
Closed

Add support for hf datasets reader #490

wants to merge 23 commits into from

Conversation

msaroufim
Copy link
Member

@msaroufim msaroufim commented Jun 1, 2022

In working on #422 I realized we needed an easier way to run benchmarks on larger datasets that may not be available on domain libraries. This PR is a prerequisite to any sort of scaling benchmarks and is generally a useful reader for the community.

Usage

>>> from torchdata.datapipes.iter import IterableWrapper, HuggingFaceHubReaderIterDataPipe
>>> huggingface_reader_dp = HuggingFaceHubReaderDataPipe("lhoestq/demo1", revision="main")
>>> elem = next(iter(huggingface_reader_dp))
>>> elem["package_name"]
com.mantz_it.rfanalyzer

https://huggingface.co/docs/datasets/how_to has about 10,000 datasets we can leverage out of the box - in particular mc4 is one we need to prove out large scale benchmarks for text

See test and docstring for usage instructions

Changes

  • Added a new HuggingFaceHubReaderIterDataPipe so we can load a large number of datasets for performance benchmarks
  • Added a test which is skipped if datasets library does not exist
  • pytest passes
  • Got rid of StreamWrapper
  • Is there any documentation update I should make?

Useful when we start doing large scale benchmarks
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 1, 2022
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also add datasets to https://github.com/pytorch/data/blob/main/test/requirements.txt to make sure the test case is carried out?

except ImportError:
datasets = None

def _get_response_from_huggingface_hub(dataset, split, revision, streaming, data_files) -> Tuple[Any, StreamWrapper]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The types of function arguments are missing, let's update them.
Also, I'm curious to know the reason that we choose these 5 arguments, the original HF load_dataset API has a lot of arguments

Copy link
Member Author

@msaroufim msaroufim Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seemed the most useful to me, I can add **kwargs so we can support all parameters

And will add types next

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I don't think we need kwargs I added all parameters I could find listed on the main documentation https://huggingface.co/docs/datasets/loading

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ninginthecloud
Copy link
Contributor

@msaroufim Thank you so much for working this PR! As you mentioned, there are many existing datasets in HF, this pipe will be quite handy for other researchers as well~

@msaroufim
Copy link
Member Author

@msaroufim Thank you so much for working this PR! As you mentioned, there are many existing datasets in HF, this pipe will be quite handy for other researchers as well~

Can you please take one last look? Not sure why I can't click re-request review on your name in the Github UI

Copy link
Contributor

@ninginthecloud ninginthecloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msaroufim Hi, Mark, the PR looks good to me now~ there's one minor type fix, you can see the comment for details.

@facebook-github-bot
Copy link
Contributor

@msaroufim has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM! Can you check the CI's linting error?
Also, is this serializable? Can you add it to test_serialization? I think it should work.

@facebook-github-bot
Copy link
Contributor

@msaroufim has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@VitalyFedyunin VitalyFedyunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but please drop length and make in unavailable (see filter as example)

@msaroufim
Copy link
Member Author

Another TODO to self is test_serialization.py is failing because it's trying to import but can't find datasets - need to fix this

@NivekT
Copy link
Contributor

NivekT commented Jun 6, 2022

Another TODO to self is test_serialization.py is failing because it's trying to import but can't find datasets - need to fix this

Let's add an entry here and we should be good to go:

def _filter_by_module_availability(datapipes):
filter_set = set()
if fsspec is None:
filter_set.update([iterdp.FSSpecFileLister, iterdp.FSSpecFileOpener, iterdp.FSSpecSaver])
if iopath is None:
filter_set.update([iterdp.IoPathFileLister, iterdp.IoPathFileOpener, iterdp.IoPathSaver])
if rarfile is None:
filter_set.update([iterdp.RarArchiveLoader])
if torcharrow is None or not DILL_AVAILABLE:
filter_set.update([iterdp.DataFrameMaker, iterdp.ParquetDataFrameLoader])
return [dp for dp in datapipes if dp[0] not in filter_set]

@facebook-github-bot
Copy link
Contributor

@msaroufim has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@msaroufim has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@NivekT
Copy link
Contributor

NivekT commented Jun 7, 2022

@ejguan Should we cherry-pick this or leave it for the next release?

@ejguan
Copy link
Contributor

ejguan commented Jun 7, 2022

@ejguan Should we cherry-pick this or leave it for the next release?

I am fine with cherry-picking it to release.

NivekT pushed a commit that referenced this pull request Jun 7, 2022
Summary:
In working on #422 I realized we needed an easier way to run benchmarks on larger datasets that may not be available on domain libraries. This PR is a prerequisite to any sort of scaling benchmarks and is generally a useful reader for the community.

https://huggingface.co/docs/datasets/how_to has about 10,000 datasets we can leverage out of the box - in particular mc4 is one we need to prove out large scale benchmarks for text

See test and docstring for usage instructions

### Changes

- Added a new `HuggingFaceHubReaderIterDataPipe` so we can load a large number of datasets for performance benchmarks
- Added a test which is skipped if `datasets` library does not exist
- pytest passes
- Got rid of `StreamWrapper`
- Is there any documentation update I should make?

Pull Request resolved: #490

Reviewed By: NivekT, ninginthecloud

Differential Revision: D36910175

Pulled By: msaroufim

fbshipit-source-id: 3ce2d5bc0ad46b626baa87b59930a3c6f5361425
NivekT added a commit that referenced this pull request Jun 7, 2022
Summary:
In working on #422 I realized we needed an easier way to run benchmarks on larger datasets that may not be available on domain libraries. This PR is a prerequisite to any sort of scaling benchmarks and is generally a useful reader for the community.

https://huggingface.co/docs/datasets/how_to has about 10,000 datasets we can leverage out of the box - in particular mc4 is one we need to prove out large scale benchmarks for text

See test and docstring for usage instructions

### Changes

- Added a new `HuggingFaceHubReaderIterDataPipe` so we can load a large number of datasets for performance benchmarks
- Added a test which is skipped if `datasets` library does not exist
- pytest passes
- Got rid of `StreamWrapper`
- Is there any documentation update I should make?

Pull Request resolved: #490

Reviewed By: NivekT, ninginthecloud

Differential Revision: D36910175

Pulled By: msaroufim

fbshipit-source-id: 3ce2d5bc0ad46b626baa87b59930a3c6f5361425

Co-authored-by: Mark Saroufim <[email protected]>
@andrewkho andrewkho deleted the msaroufim-patch-2 branch September 26, 2024 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants