Add support for hf `datasets` reader #490

msaroufim · 2022-06-01T18:28:34Z

In working on #422 I realized we needed an easier way to run benchmarks on larger datasets that may not be available on domain libraries. This PR is a prerequisite to any sort of scaling benchmarks and is generally a useful reader for the community.

Usage

>>> from torchdata.datapipes.iter import IterableWrapper, HuggingFaceHubReaderIterDataPipe
>>> huggingface_reader_dp = HuggingFaceHubReaderDataPipe("lhoestq/demo1", revision="main")
>>> elem = next(iter(huggingface_reader_dp))
>>> elem["package_name"]
com.mantz_it.rfanalyzer

https://huggingface.co/docs/datasets/how_to has about 10,000 datasets we can leverage out of the box - in particular mc4 is one we need to prove out large scale benchmarks for text

See test and docstring for usage instructions

Changes

Added a new HuggingFaceHubReaderIterDataPipe so we can load a large number of datasets for performance benchmarks
Added a test which is skipped if datasets library does not exist
pytest passes
Got rid of StreamWrapper
Is there any documentation update I should make?

Useful when we start doing large scale benchmarks

torchdata/datapipes/iter/load/huggingface.py

test/test_huggingface_datasets.py

torchdata/datapipes/iter/load/huggingface.py

ejguan

Could you please also add datasets to https://github.com/pytorch/data/blob/main/test/requirements.txt to make sure the test case is carried out?

torchdata/datapipes/iter/load/huggingface.py

ninginthecloud · 2022-06-02T04:45:38Z

torchdata/datapipes/iter/load/huggingface.py

+except ImportError:
+    datasets = None
+
+def _get_response_from_huggingface_hub(dataset, split, revision, streaming, data_files) -> Tuple[Any, StreamWrapper]:


The types of function arguments are missing, let's update them.
Also, I'm curious to know the reason that we choose these 5 arguments, the original HF load_dataset API has a lot of arguments

These seemed the most useful to me, I can add **kwargs so we can support all parameters

And will add types next

Ok I don't think we need kwargs I added all parameters I could find listed on the main documentation https://huggingface.co/docs/datasets/loading

There are 16. @ninginthecloud Do you think having those 5 should be sufficient?

https://huggingface.co/docs/datasets/package_reference/loading_methods#datasets.load_dataset

torchdata/datapipes/iter/load/huggingface.py

ninginthecloud · 2022-06-02T05:07:12Z

@msaroufim Thank you so much for working this PR! As you mentioned, there are many existing datasets in HF, this pipe will be quite handy for other researchers as well~

msaroufim · 2022-06-02T22:11:38Z

@msaroufim Thank you so much for working this PR! As you mentioned, there are many existing datasets in HF, this pipe will be quite handy for other researchers as well~

Can you please take one last look? Not sure why I can't click re-request review on your name in the Github UI

torchdata/datapipes/iter/load/huggingface.py

ninginthecloud

@msaroufim Hi, Mark, the PR looks good to me now~ there's one minor type fix, you can see the comment for details.

facebook-github-bot · 2022-06-03T21:40:12Z

@msaroufim has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

NivekT

Overall, LGTM! Can you check the CI's linting error?
Also, is this serializable? Can you add it to test_serialization? I think it should work.

…nto msaroufim-patch-2

facebook-github-bot · 2022-06-03T22:14:22Z

@msaroufim has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

VitalyFedyunin

LGTM but please drop length and make in unavailable (see filter as example)

msaroufim · 2022-06-06T17:02:14Z

Another TODO to self is test_serialization.py is failing because it's trying to import but can't find datasets - need to fix this

NivekT · 2022-06-06T17:43:58Z

Another TODO to self is test_serialization.py is failing because it's trying to import but can't find datasets - need to fix this

Let's add an entry here and we should be good to go:

data/test/test_serialization.py

Lines 75 to 85 in 7b5e7d9

    
           def _filter_by_module_availability(datapipes): 
        
               filter_set = set() 
        
               if fsspec is None: 
        
                   filter_set.update([iterdp.FSSpecFileLister, iterdp.FSSpecFileOpener, iterdp.FSSpecSaver]) 
        
               if iopath is None: 
        
                   filter_set.update([iterdp.IoPathFileLister, iterdp.IoPathFileOpener, iterdp.IoPathSaver]) 
        
               if rarfile is None: 
        
                   filter_set.update([iterdp.RarArchiveLoader]) 
        
               if torcharrow is None or not DILL_AVAILABLE: 
        
                   filter_set.update([iterdp.DataFrameMaker, iterdp.ParquetDataFrameLoader]) 
        
               return [dp for dp in datapipes if dp[0] not in filter_set]

facebook-github-bot · 2022-06-06T23:28:27Z

@msaroufim has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-06-06T23:42:07Z

@msaroufim has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

NivekT · 2022-06-07T17:14:19Z

@ejguan Should we cherry-pick this or leave it for the next release?

ejguan · 2022-06-07T17:19:21Z

@ejguan Should we cherry-pick this or leave it for the next release?

I am fine with cherry-picking it to release.

Summary: In working on #422 I realized we needed an easier way to run benchmarks on larger datasets that may not be available on domain libraries. This PR is a prerequisite to any sort of scaling benchmarks and is generally a useful reader for the community. https://huggingface.co/docs/datasets/how_to has about 10,000 datasets we can leverage out of the box - in particular mc4 is one we need to prove out large scale benchmarks for text See test and docstring for usage instructions ### Changes - Added a new `HuggingFaceHubReaderIterDataPipe` so we can load a large number of datasets for performance benchmarks - Added a test which is skipped if `datasets` library does not exist - pytest passes - Got rid of `StreamWrapper` - Is there any documentation update I should make? Pull Request resolved: #490 Reviewed By: NivekT, ninginthecloud Differential Revision: D36910175 Pulled By: msaroufim fbshipit-source-id: 3ce2d5bc0ad46b626baa87b59930a3c6f5361425

Summary: In working on #422 I realized we needed an easier way to run benchmarks on larger datasets that may not be available on domain libraries. This PR is a prerequisite to any sort of scaling benchmarks and is generally a useful reader for the community. https://huggingface.co/docs/datasets/how_to has about 10,000 datasets we can leverage out of the box - in particular mc4 is one we need to prove out large scale benchmarks for text See test and docstring for usage instructions ### Changes - Added a new `HuggingFaceHubReaderIterDataPipe` so we can load a large number of datasets for performance benchmarks - Added a test which is skipped if `datasets` library does not exist - pytest passes - Got rid of `StreamWrapper` - Is there any documentation update I should make? Pull Request resolved: #490 Reviewed By: NivekT, ninginthecloud Differential Revision: D36910175 Pulled By: msaroufim fbshipit-source-id: 3ce2d5bc0ad46b626baa87b59930a3c6f5361425 Co-authored-by: Mark Saroufim <[email protected]>

Add support for hf datasets reader

a47c967

Useful when we start doing large scale benchmarks

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 1, 2022

msaroufim added 4 commits June 1, 2022 18:56

[skip ci] added test

5aed191

reverted changes to onlin.py

69973ba

add

2506a6a

fixed response

a9255b2

msaroufim requested review from ejguan and VitalyFedyunin June 1, 2022 19:19

fix

12bffa2

ejguan reviewed Jun 1, 2022

View reviewed changes

torchdata/datapipes/iter/load/huggingface.py Show resolved Hide resolved

torchdata/datapipes/iter/load/huggingface.py Show resolved Hide resolved

ejguan reviewed Jun 1, 2022

View reviewed changes

torchdata/datapipes/iter/load/huggingface.py Outdated Show resolved Hide resolved

fix

21ef96a

VitalyFedyunin reviewed Jun 1, 2022

View reviewed changes

torchdata/datapipes/iter/load/huggingface.py Outdated Show resolved Hide resolved

fix

bdd3181

VitalyFedyunin reviewed Jun 1, 2022

View reviewed changes

test/test_huggingface_datasets.py Outdated Show resolved Hide resolved

msaroufim added 2 commits June 1, 2022 19:39

fix

4db0483

[skip ci] push

3bff1ec

ejguan reviewed Jun 1, 2022

View reviewed changes

torchdata/datapipes/iter/load/huggingface.py Show resolved Hide resolved

ejguan reviewed Jun 1, 2022

View reviewed changes

torchdata/datapipes/iter/load/huggingface.py Outdated Show resolved Hide resolved

ejguan reviewed Jun 1, 2022

View reviewed changes

msaroufim and others added 3 commits June 1, 2022 19:57

[skip ci] update

e0a745c

added datasets in dependencies

8523f50

Update huggingface.py

63be46a

NivekT reviewed Jun 1, 2022

View reviewed changes

torchdata/datapipes/iter/load/huggingface.py Outdated Show resolved Hide resolved

NivekT reviewed Jun 1, 2022

View reviewed changes

torchdata/datapipes/iter/load/huggingface.py Show resolved Hide resolved

Update huggingface.py

9b24d90

ninginthecloud reviewed Jun 2, 2022

View reviewed changes

torchdata/datapipes/iter/load/huggingface.py Outdated Show resolved Hide resolved

ninginthecloud reviewed Jun 2, 2022

View reviewed changes

torchdata/datapipes/iter/load/huggingface.py Outdated Show resolved Hide resolved

msaroufim requested review from VitalyFedyunin, NivekT and ejguan June 2, 2022 22:10

msaroufim added 2 commits June 2, 2022 22:15

fix docstring

7db0094

mypy

08ec57c

ninginthecloud reviewed Jun 3, 2022

View reviewed changes

torchdata/datapipes/iter/load/huggingface.py Outdated Show resolved Hide resolved

ninginthecloud approved these changes Jun 3, 2022

View reviewed changes

Update huggingface.py

ce5ac0e

NivekT approved these changes Jun 3, 2022

View reviewed changes

msaroufim added 2 commits June 3, 2022 22:04

serialization

e717df3

Merge branch 'msaroufim-patch-2' of https://github.com/pytorch/data i…

7b5e7d9

…nto msaroufim-patch-2

VitalyFedyunin approved these changes Jun 6, 2022

View reviewed changes

final

51c4d14

added missing import ignore

216fb98

facebook-github-bot closed this in bb78231 Jun 7, 2022

This was referenced Jun 7, 2022

[v0.4.0] Release Tracker #448

Closed

[Release 0.4.0] Add support for hf datasets reader (#490) #501

Merged

andrewkho deleted the msaroufim-patch-2 branch September 26, 2024 04:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for hf `datasets` reader #490

Add support for hf `datasets` reader #490

msaroufim commented Jun 1, 2022 •

edited

Loading

ejguan left a comment

ninginthecloud Jun 2, 2022

msaroufim Jun 2, 2022 •

edited

Loading

msaroufim Jun 2, 2022

NivekT Jun 3, 2022

ninginthecloud commented Jun 2, 2022

msaroufim commented Jun 2, 2022

ninginthecloud left a comment

facebook-github-bot commented Jun 3, 2022

NivekT left a comment

facebook-github-bot commented Jun 3, 2022

VitalyFedyunin left a comment

msaroufim commented Jun 6, 2022

NivekT commented Jun 6, 2022

facebook-github-bot commented Jun 6, 2022

facebook-github-bot commented Jun 6, 2022

NivekT commented Jun 7, 2022

ejguan commented Jun 7, 2022

Add support for hf datasets reader #490

Add support for hf datasets reader #490

Conversation

msaroufim commented Jun 1, 2022 • edited Loading

Changes

ejguan left a comment

Choose a reason for hiding this comment

ninginthecloud Jun 2, 2022

Choose a reason for hiding this comment

msaroufim Jun 2, 2022 • edited Loading

Choose a reason for hiding this comment

msaroufim Jun 2, 2022

Choose a reason for hiding this comment

NivekT Jun 3, 2022

Choose a reason for hiding this comment

ninginthecloud commented Jun 2, 2022

msaroufim commented Jun 2, 2022

ninginthecloud left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jun 3, 2022

NivekT left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jun 3, 2022

VitalyFedyunin left a comment

Choose a reason for hiding this comment

msaroufim commented Jun 6, 2022

NivekT commented Jun 6, 2022

facebook-github-bot commented Jun 6, 2022

facebook-github-bot commented Jun 6, 2022

NivekT commented Jun 7, 2022

ejguan commented Jun 7, 2022

Add support for hf `datasets` reader #490

Add support for hf `datasets` reader #490

msaroufim commented Jun 1, 2022 •

edited

Loading

msaroufim Jun 2, 2022 •

edited

Loading