-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for hf datasets
reader
#490
Conversation
Useful when we start doing large scale benchmarks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please also add datasets
to https://github.com/pytorch/data/blob/main/test/requirements.txt to make sure the test case is carried out?
except ImportError: | ||
datasets = None | ||
|
||
def _get_response_from_huggingface_hub(dataset, split, revision, streaming, data_files) -> Tuple[Any, StreamWrapper]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The types of function arguments are missing, let's update them.
Also, I'm curious to know the reason that we choose these 5 arguments, the original HF load_dataset API has a lot of arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seemed the most useful to me, I can add **kwargs
so we can support all parameters
And will add types next
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I don't think we need kwargs I added all parameters I could find listed on the main documentation https://huggingface.co/docs/datasets/loading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 16. @ninginthecloud Do you think having those 5 should be sufficient?
https://huggingface.co/docs/datasets/package_reference/loading_methods#datasets.load_dataset
@msaroufim Thank you so much for working this PR! As you mentioned, there are many existing datasets in HF, this pipe will be quite handy for other researchers as well~ |
Can you please take one last look? Not sure why I can't click re-request review on your name in the Github UI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msaroufim Hi, Mark, the PR looks good to me now~ there's one minor type fix, you can see the comment for details.
@msaroufim has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, LGTM! Can you check the CI's linting error?
Also, is this serializable? Can you add it to test_serialization
? I think it should work.
…nto msaroufim-patch-2
@msaroufim has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but please drop length and make in unavailable (see filter
as example)
Another TODO to self is |
Let's add an entry here and we should be good to go: data/test/test_serialization.py Lines 75 to 85 in 7b5e7d9
|
@msaroufim has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@msaroufim has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@ejguan Should we cherry-pick this or leave it for the next release? |
I am fine with cherry-picking it to release. |
Summary: In working on #422 I realized we needed an easier way to run benchmarks on larger datasets that may not be available on domain libraries. This PR is a prerequisite to any sort of scaling benchmarks and is generally a useful reader for the community. https://huggingface.co/docs/datasets/how_to has about 10,000 datasets we can leverage out of the box - in particular mc4 is one we need to prove out large scale benchmarks for text See test and docstring for usage instructions ### Changes - Added a new `HuggingFaceHubReaderIterDataPipe` so we can load a large number of datasets for performance benchmarks - Added a test which is skipped if `datasets` library does not exist - pytest passes - Got rid of `StreamWrapper` - Is there any documentation update I should make? Pull Request resolved: #490 Reviewed By: NivekT, ninginthecloud Differential Revision: D36910175 Pulled By: msaroufim fbshipit-source-id: 3ce2d5bc0ad46b626baa87b59930a3c6f5361425
Summary: In working on #422 I realized we needed an easier way to run benchmarks on larger datasets that may not be available on domain libraries. This PR is a prerequisite to any sort of scaling benchmarks and is generally a useful reader for the community. https://huggingface.co/docs/datasets/how_to has about 10,000 datasets we can leverage out of the box - in particular mc4 is one we need to prove out large scale benchmarks for text See test and docstring for usage instructions ### Changes - Added a new `HuggingFaceHubReaderIterDataPipe` so we can load a large number of datasets for performance benchmarks - Added a test which is skipped if `datasets` library does not exist - pytest passes - Got rid of `StreamWrapper` - Is there any documentation update I should make? Pull Request resolved: #490 Reviewed By: NivekT, ninginthecloud Differential Revision: D36910175 Pulled By: msaroufim fbshipit-source-id: 3ce2d5bc0ad46b626baa87b59930a3c6f5361425 Co-authored-by: Mark Saroufim <[email protected]>
In working on #422 I realized we needed an easier way to run benchmarks on larger datasets that may not be available on domain libraries. This PR is a prerequisite to any sort of scaling benchmarks and is generally a useful reader for the community.
Usage
https://huggingface.co/docs/datasets/how_to has about 10,000 datasets we can leverage out of the box - in particular mc4 is one we need to prove out large scale benchmarks for text
See test and docstring for usage instructions
Changes
HuggingFaceHubReaderIterDataPipe
so we can load a large number of datasets for performance benchmarksdatasets
library does not existStreamWrapper