Create endpoint /dataset-info #670

severo · 2022-12-23T21:45:59Z

No description provided.

severo · 2022-12-23T21:47:22Z

It's only the start, but interested in your opinion on https://github.com/huggingface/datasets-server/pull/670/files#diff-0f066cc0774e19de939dd3c15c9b224c193fe83b71468cdb33315fce49a45ddfR27-R34, @huggingface/datasets

codecov-commenter · 2022-12-23T21:50:29Z

Codecov Report

Base: 90.67% // Head: 91.07% // Increases project coverage by +0.40% 🎉

Coverage data is based on head (69f2d0b) compared to base (30b508c).
Patch coverage: 95.44% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #670      +/-   ##
==========================================
+ Coverage   90.67%   91.07%   +0.40%     
==========================================
  Files          38       27      -11     
  Lines        2648     1849     -799     
==========================================
- Hits         2401     1684     -717     
+ Misses        247      165      -82

Flag	Coverage Δ
libs_libcommon	`?`
workers_datasets_based	`91.07% <95.44%> (-1.86%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
workers/datasets_based/tests/fixtures/datasets.py	`100.00% <ø> (ø)`
workers/datasets_based/tests/conftest.py	`90.76% <73.68%> (-7.15%)`	⬇️
...s_based/src/datasets_based/workers/dataset_info.py	`92.00% <92.00%> (ø)`
...datasets_based/workers/parquet_and_dataset_info.py	`92.52% <92.52%> (ø)`
...tasets_based/src/datasets_based/workers/parquet.py	`92.15% <93.54%> (+0.33%)`	⬆️
...orkers/datasets_based/src/datasets_based/config.py	`98.78% <100.00%> (ø)`
...orkers/datasets_based/src/datasets_based/worker.py	`87.91% <100.00%> (ø)`
...atasets_based/src/datasets_based/worker_factory.py	`100.00% <100.00%> (ø)`
...s/datasets_based/src/datasets_based/worker_loop.py	`48.10% <100.00%> (ø)`
...c/datasets_based/workers/_datasets_based_worker.py	`96.07% <100.00%> (ø)`
... and 27 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

severo · 2022-12-26T11:10:52Z

See internal discussion at https://huggingface.slack.com/archives/C0311GZ7R6K/p1672052009820369

severo · 2022-12-27T10:18:05Z

~~Blocked until https://github.com/huggingface/hffs/ is released publicly.~~ It's now public

severo · 2022-12-27T11:16:19Z

Should we use Dask (https://docs.dask.org/en/stable/generated/dask.dataframe.read_parquet.html) to read the parquet file, or is it better to directly use pyarrow?

albertvillanova

Awesome!!!

I like your idea of using the metadata in the Parquet footer.

Additionally it contains the features info as additional metadata: ;)

>>> json.loads(metadata.metadata[b"huggingface"])

{'info': {'features': {'sentence1': {'dtype': 'string', '_type': 'Value'},
   'sentence2': {'dtype': 'string', '_type': 'Value'},
   'idx': {'dtype': 'int32', '_type': 'Value'},
   'label': {'names': ['entailment', 'not_entailment'],
    '_type': 'ClassLabel'}}}}

albertvillanova · 2022-12-28T07:55:20Z

workers/parquet_based/src/parquet_based/workers/size.py

+        REVISION = "refs/convert/parquet"
+        fs = hffs.HfFileSystem(self.dataset, repo_type="dataset", revision=REVISION)
+        metadata = pq.read_metadata(f"{self.config}/{self.filename}", filesystem=fs)
+        # ^ are we streaming to only read the metadata in the footer, or is the whole parquet file read?


are we streaming to only read the metadata in the footer, or is the whole parquet file read?

I think pq.read_metadata only reads the metadata. However it only accepts a string to a local path (not remote URL) or a file-like object, if no filesystem is passed.

Our datasets streaming mode calls fsspec under the hood, analogously to:

url = "https://huggingface.co/datasets/super_glue/resolve/refs%2Fconvert%2Fparquet/axb/super_glue-test.parquet" with fsspec.open(url) as f: metadata = pq.read_metadata(f)

it reads the remote parquet files (from the Hub) with hffs and pyarrow.

using libcommon 0.6.2, we implement get_new_splits to be able to create the children jobs. Also: ensure the type of the config and split (str) in /splits

and also update the libraries to fix vulnerabilities (torch, gitpython)

use only one definition. Also: remove the ignored vulnerabilities, since the dependencies have been updated

we will add "stats" with more details in the parquet worker. BREAKING CHANGE: 🧨 change the /splits response (num_bytes and num_examples are removed)

It's not very efficient, but we stay in the same architecture model. So: we first get the list of parquet files and the dataset-info for each config, then we copy each part to its own response

it does not make sense to have them in libcommon, since we will come back to only one generic "worker"

we don't check for its value anyway

To test only one subpath, eg TEST_PATH=tests/test_one.py make test

albertvillanova reviewed Dec 28, 2022

View reviewed changes

severo mentioned this pull request Jan 13, 2023

Create children in generic worker #677

Merged

severo added 6 commits January 16, 2023 12:56

feat: 🎸 add /size endpoint to the processing graph

13d47fb

feat: 🎸 first test with parquet based worker

4adc059

it reads the remote parquet files (from the Hub) with hffs and pyarrow.

fix: 🐛 make "size" depend on "parquet"

fc5832d

feat: 🎸 create the children jobs in a generic way

7e6eaa4

refactor: 💡 implement get_new_splits for all workers

d1f2e3a

using libcommon 0.6.2, we implement get_new_splits to be able to create the children jobs. Also: ensure the type of the config and split (str) in /splits

feat: 🎸 ensure we use the correct version of libcommon

a6820f0

and also update the libraries to fix vulnerabilities (torch, gitpython)

severo force-pushed the create-endpoint-size branch from 8e878c1 to a6820f0 Compare January 16, 2023 13:16

severo added 16 commits January 16, 2023 13:41

fix: 🐛 adapt to dependencies updates

50189bd

ci: 🎡 simplify pip-audit

c6320b7

use only one definition. Also: remove the ignored vulnerabilities, since the dependencies have been updated

ci: 🎡 update gitpython (and other dependencies)

813e466

fix: 🐛 fix small issues

d679464

feat: 🎸 add serialized size

aa3a91a

feat: 🎸 remove num_bytes and num_samples from /splits

e54cbd5

we will add "stats" with more details in the parquet worker. BREAKING CHANGE: 🧨 change the /splits response (num_bytes and num_examples are removed)

feat: 🎸 create new steps to fill both parquet and dataset-info

a71cdca

It's not very efficient, but we stay in the same architecture model. So: we first get the list of parquet files and the dataset-info for each config, then we copy each part to its own response

refactor: 💡 move worker and worker_loop to workers/

bcc0603

it does not make sense to have them in libcommon, since we will come back to only one generic "worker"

feat: 🎸 remove parquet_based worker

a6031c3

refactor: 💡 rename /parquet to /parquet-and-dataset-info

a59931e

style: 💄 change key in expected object

7a8415a

we don't check for its value anyway

fix: 🐛 wait until after post_compute, because it can fail

a897eba

style: 💄 fix missing refactoring

dc80d37

feat: 🎸 add /parquet worker

f7cde31

chore: 🤖 allow to set TEST_PATH when calling make test

3b4905a

To test only one subpath, eg TEST_PATH=tests/test_one.py make test

style: 💄 fix style

e0d905c

feat: 🎸 add /dataset-info worker

2c8e163

severo changed the title ~~Create endpoint size~~ Create endpoint /dataset-info Jan 18, 2023

severo added 7 commits January 18, 2023 18:42

feat: 🎸 update libcommon

3ff0a9c

chore: 🤖 add /parquet-and-dataset-info and /dataset-info worker

29fa1af

chore: 🤖 add two workers to docker-compose

081636b

feat: 🎸 update images

1b976c1

test: 💍 update e2e tests

ad76780

fix: 🐛 add missing job type in error message

69f2d0b

ci: 🎡 fix secret

0b8beb1

severo merged commit b3ac6a1 into main Jan 18, 2023

severo deleted the create-endpoint-size branch January 18, 2023 21:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create endpoint /dataset-info #670

Create endpoint /dataset-info #670

severo commented Dec 23, 2022

severo commented Dec 23, 2022

codecov-commenter commented Dec 23, 2022 •

edited

Loading

severo commented Dec 26, 2022

severo commented Dec 27, 2022 •

edited

Loading

severo commented Dec 27, 2022

albertvillanova left a comment

albertvillanova Dec 28, 2022

Create endpoint /dataset-info #670

Create endpoint /dataset-info #670

Conversation

severo commented Dec 23, 2022

severo commented Dec 23, 2022

codecov-commenter commented Dec 23, 2022 • edited Loading

Codecov Report

severo commented Dec 26, 2022

severo commented Dec 27, 2022 • edited Loading

severo commented Dec 27, 2022

albertvillanova left a comment

Choose a reason for hiding this comment

albertvillanova Dec 28, 2022

Choose a reason for hiding this comment

codecov-commenter commented Dec 23, 2022 •

edited

Loading

severo commented Dec 27, 2022 •

edited

Loading