Implement generic processing steps #650

severo · 2022-11-28T15:08:38Z

Generic implementation of a processing graph

Remove explicit mentions to /splits or /first-rows from code, and move them to the "processing graph":

{
  "/splits": {"input_type": "dataset", "required_by_dataset_viewer": true},
  "/first-rows": {"input_type": "split", "requires": "/splits", "required_by_dataset_viewer": true},
}

This JSON (see libcommon.config) defines the processing steps (here /splits and /first-rows) and their dependency relationship (here /first-rows depends on /splits). It also defines if a processing step is required by the Hub dataset viewer (used to fill /valid and /is-valid).
A processing step is defined by the endpoint (/splits, /first-rows), where the result of the processing step can be downloaded. The endpoint value is also used as the cache key and the job type.

After this change, adding a new processing step should consist in:

creating a new worker in the workers/ directory
update the processing graph
update the CI, tests, docs and deployment (docker-compose files, e2e tests, docs, openapi, helm chart)

This also means that the services (API, admin) don't contain any code mentioning directly splits or first-rows. And the splits worker does not contain direct reference to first-rows.

Other changes

code: the libcache and libqueue libraries have been merged into libcommon
the code to check if a dataset is supported (exists, is not private, access can be programmatically obtained if gated) has been factorized and is now used before every processing step and before even accepting to create a new job (through the webhook or through the /admin/force-refresh endpoint).
add a new endpoint: /admin/cancel-jobs, which replaces the last admin scripts. It's easier to send a POST request than to call a remote script.
simplify the code of the workers by factorizing some code into libcommon:
- the code to test if a job should be skipped, based on the versions of the git repository and the worker
- the logic to catch errors and to write to the cache
  This way, the code for every worker now only contains what is specific to that worker.

Breaking changes

env vars QUEUE_MAX_LOAD_PCT, QUEUE_MAX_MEMORY_PCT and QUEUE_SLEEP_SECONDS are renamed as WORKER_MAX_LOAD_PCT, WORKER_MAX_MEMORY_PCT and WORKER_SLEEP_SECONDS.

Note that it will not pass the CI because - the CI token is not allowed to push to refs/convert/parquet (should be in the "datasets-maintainers" org) - the refs/convert/parquet does not exist and cannot be created for now

we don't use it, and it's private for now

associate each parquet file with a split and a config (based on path parsing)

@lhoestq

thanks @lhoestq

See pytest-dev/py#287 (comment)

Gated datasets with extra fields are not supported. Note also that only one token is used now.

Also fix the tests, and disable gated+private for now

also: rename functions to be more accurate

and replace the last scripts with the /cancel-jobs/xxx endpoints.

also: replace Dict with Mapping

some tests have been moved (commented yet) to e2e, since it becomes hard to simulate all the Hub endpoints -> better to test the scenari against the real Hub instead

since it's not the scope of this PR

severo added 30 commits November 24, 2022 13:36

feat: 🎸 add /cache-reports/parquet endpoint and parquet reports

24a0b47

feat: 🎸 add the /parquet endpoint

a1dfa63

feat: 🎸 add parquet worker

e667077

Note that it will not pass the CI because - the CI token is not allowed to push to refs/convert/parquet (should be in the "datasets-maintainers" org) - the refs/convert/parquet does not exist and cannot be created for now

ci: 🎡 add CI for the worker

ec634f6

feat: 🎸 remove the hffs dependency

4165f8e

we don't use it, and it's private for now

feat: 🎸 change the response format

d8a5568

associate each parquet file with a split and a config (based on path parsing)

fix: 🐛 handle the fact that "SSSSS-of-NNNNN" is "optional"

c2f8f23

thanks @lhoestq

fix: 🐛 fill two fields to known versions in case of error

8819199

feat: 🎸 upgrade datasets to 2.7.0

ebf3838

ci: 🎡 fix action

733018e

feat: 🎸 create ref/convert/parquet if it does not exist

b75b8f4

feat: 🎸 update pytest

90752ba

See pytest-dev/py#287 (comment)

feat: 🎸 unlock access to the gated datasets

8f27d31

Gated datasets with extra fields are not supported. Note also that only one token is used now.

feat: 🎸 check if the dataset is supported only for existing one

889dbb8

fix: 🐛 fix config

f00f582

fix: 🐛 fix the branch argument + fix case where ref is created

68c24b7

fix: 🐛 fix logic of the worker, to ensure we get the git sha

6c6cdd2

Also fix the tests, and disable gated+private for now

fix: 🐛 fix gated datasets and update the tests

2262437

test: 💍 assert that gated with extra fields are not supported

6983127

fix: 🐛 add controls on the dataset_git_revision

13fbdf4

feat: 🎸 upgrade datasets

2dc418c

feat: 🎸 add script to refresh parquet response

5858918

fix: 🐛 fix the condition to test if the split exists in list

2589aa1

also: rename functions to be more accurate

refactor: 💡 use exceptions to make the flow clearer

9cf401b

feat: 🎸 add processing_steps

2f44656

fix: 🐛 fix signature

f24b935

chore: 🤖 adapt to poetry 1.2, use pip-audit

0e5037f

feat: 🎸 use ProcessingStep in api service

0a175f0

feat: 🎸 use ProcessingStep in admin service

d190f55

and replace the last scripts with the /cancel-jobs/xxx endpoints.

style: 💄 fix style

5d579c2

severo added 23 commits November 25, 2022 19:33

feat: 🎸 update to libcommon 0.4.2

ad5cede

ci: 🎡 fix ci

3ae8dbf

docs: ✏️ fix docstring

95ad1e6

feat: 🎸 update to libcommon 0.4.2

a58ea54

refactor: 💡 use Mapping instead of Dict

5953a99

feat: 🎸 update to libcommon 0.4.2

5ea3502

also: replace Dict with Mapping

fix: 🐛 use Dict because it must be mutable

7e802c9

fix: 🐛 missing import

2eb8a9a

feat: 🎸 remplace dependency with previous_step and next_steps

21c42d8

feat: 🎸 define the processing graph in the configuration

ed056d6

feat: 🎸 upgrade to libcommon 0.5

b7047a0

feat: 🎸 upgrade to libcommon 0.5

b46cfcf

feat: 🎸 upgrade to libcommon 0.5

c815296

feat: 🎸 upgrade to libcommon 0.5.0

9928a3c

feat: 🎸 upgrade to libcommon 0.5

744bcd3

feat: 🎸 upgrade to libcommon 0.5

5b9a872

refactor: 💡 add logic methods to simplify services and workers

5b0a8d8

feat: 🎸 upgrade to libcommon 0.5.1

cbd3f0c

some tests have been moved (commented yet) to e2e, since it becomes hard to simulate all the Hub endpoints -> better to test the scenari against the real Hub instead

feat: 🎸 upgrade to libcommon 0.5.1

7eb22ac

feat: 🎸 remove parquet processing step

6907835

since it's not the scope of this PR

style: 💄 fix stykle

9bc1d88

ci: 🎡 remove parquet ci

e9946ba

feat: 🎸 upgrade docker images

6ca3aa9

severo mentioned this pull request Nov 28, 2022

Parquet worker #631

Closed

severo added 4 commits November 28, 2022 20:59

test: 💍 add some tests for the webhook

6a899d8

test: 💍 update e2e tests (and error messages in openapi)

ff77b0a

style: 💄 fix style

2cacad9

feat: 🎸 remove parquet code

c9906ee

severo merged commit 8e5f876 into main Nov 28, 2022

severo deleted the implement-generic-processing-steps branch November 28, 2022 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement generic processing steps #650

Implement generic processing steps #650

severo commented Nov 28, 2022 •

edited

Loading

Implement generic processing steps #650

Implement generic processing steps #650

Conversation

severo commented Nov 28, 2022 • edited Loading

Generic implementation of a processing graph

Other changes

Breaking changes

severo commented Nov 28, 2022 •

edited

Loading