-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create endpoint /dataset-info #670
Conversation
It's only the start, but interested in your opinion on https://github.com/huggingface/datasets-server/pull/670/files#diff-0f066cc0774e19de939dd3c15c9b224c193fe83b71468cdb33315fce49a45ddfR27-R34, @huggingface/datasets |
Codecov ReportBase: 90.67% // Head: 91.07% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #670 +/- ##
==========================================
+ Coverage 90.67% 91.07% +0.40%
==========================================
Files 38 27 -11
Lines 2648 1849 -799
==========================================
- Hits 2401 1684 -717
+ Misses 247 165 -82
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
See internal discussion at https://huggingface.slack.com/archives/C0311GZ7R6K/p1672052009820369 |
|
Should we use Dask (https://docs.dask.org/en/stable/generated/dask.dataframe.read_parquet.html) to read the parquet file, or is it better to directly use pyarrow? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!!!
I like your idea of using the metadata in the Parquet footer.
Additionally it contains the features info as additional metadata: ;)
>>> json.loads(metadata.metadata[b"huggingface"])
{'info': {'features': {'sentence1': {'dtype': 'string', '_type': 'Value'},
'sentence2': {'dtype': 'string', '_type': 'Value'},
'idx': {'dtype': 'int32', '_type': 'Value'},
'label': {'names': ['entailment', 'not_entailment'],
'_type': 'ClassLabel'}}}}
REVISION = "refs/convert/parquet" | ||
fs = hffs.HfFileSystem(self.dataset, repo_type="dataset", revision=REVISION) | ||
metadata = pq.read_metadata(f"{self.config}/{self.filename}", filesystem=fs) | ||
# ^ are we streaming to only read the metadata in the footer, or is the whole parquet file read? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we streaming to only read the metadata in the footer, or is the whole parquet file read?
I think pq.read_metadata
only reads the metadata. However it only accepts a string to a local path (not remote URL) or a file-like object, if no filesystem
is passed.
Our datasets streaming mode calls fsspec
under the hood, analogously to:
url = "https://huggingface.co/datasets/super_glue/resolve/refs%2Fconvert%2Fparquet/axb/super_glue-test.parquet"
with fsspec.open(url) as f:
metadata = pq.read_metadata(f)
it reads the remote parquet files (from the Hub) with hffs and pyarrow.
using libcommon 0.6.2, we implement get_new_splits to be able to create the children jobs. Also: ensure the type of the config and split (str) in /splits
and also update the libraries to fix vulnerabilities (torch, gitpython)
8e878c1
to
a6820f0
Compare
use only one definition. Also: remove the ignored vulnerabilities, since the dependencies have been updated
we will add "stats" with more details in the parquet worker. BREAKING CHANGE: 🧨 change the /splits response (num_bytes and num_examples are removed)
It's not very efficient, but we stay in the same architecture model. So: we first get the list of parquet files and the dataset-info for each config, then we copy each part to its own response
it does not make sense to have them in libcommon, since we will come back to only one generic "worker"
we don't check for its value anyway
To test only one subpath, eg TEST_PATH=tests/test_one.py make test
No description provided.