Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement generic processing steps #650

Merged
merged 69 commits into from
Nov 28, 2022
Merged

Conversation

severo
Copy link
Collaborator

@severo severo commented Nov 28, 2022

Generic implementation of a processing graph

Remove explicit mentions to /splits or /first-rows from code, and move them to the "processing graph":

{
  "/splits": {"input_type": "dataset", "required_by_dataset_viewer": true},
  "/first-rows": {"input_type": "split", "requires": "/splits", "required_by_dataset_viewer": true},
}

This JSON (see libcommon.config) defines the processing steps (here /splits and /first-rows) and their dependency relationship (here /first-rows depends on /splits). It also defines if a processing step is required by the Hub dataset viewer (used to fill /valid and /is-valid).
A processing step is defined by the endpoint (/splits, /first-rows), where the result of the processing step can be downloaded. The endpoint value is also used as the cache key and the job type.

After this change, adding a new processing step should consist in:

  • creating a new worker in the workers/ directory
  • update the processing graph
  • update the CI, tests, docs and deployment (docker-compose files, e2e tests, docs, openapi, helm chart)

This also means that the services (API, admin) don't contain any code mentioning directly splits or first-rows. And the splits worker does not contain direct reference to first-rows.

Other changes

  • code: the libcache and libqueue libraries have been merged into libcommon
  • the code to check if a dataset is supported (exists, is not private, access can be programmatically obtained if gated) has been factorized and is now used before every processing step and before even accepting to create a new job (through the webhook or through the /admin/force-refresh endpoint).
  • add a new endpoint: /admin/cancel-jobs, which replaces the last admin scripts. It's easier to send a POST request than to call a remote script.
  • simplify the code of the workers by factorizing some code into libcommon:
    • the code to test if a job should be skipped, based on the versions of the git repository and the worker
    • the logic to catch errors and to write to the cache
      This way, the code for every worker now only contains what is specific to that worker.

Breaking changes

  • env vars QUEUE_MAX_LOAD_PCT, QUEUE_MAX_MEMORY_PCT and QUEUE_SLEEP_SECONDS are renamed as WORKER_MAX_LOAD_PCT, WORKER_MAX_MEMORY_PCT and WORKER_SLEEP_SECONDS.

Note that it will not pass the CI because
- the CI token is not allowed to push to refs/convert/parquet (should be
  in the "datasets-maintainers" org)
- the refs/convert/parquet does not exist and cannot be created for now
we don't use it, and it's private for now
associate each parquet file with a split and a config (based on path
parsing)
Gated datasets with extra fields are not supported. Note also that only
one token is used now.
Also fix the tests, and disable gated+private for now
also: rename functions to be more accurate
and replace the last scripts with the /cancel-jobs/xxx endpoints.
@severo severo mentioned this pull request Nov 28, 2022
@severo severo merged commit 8e5f876 into main Nov 28, 2022
@severo severo deleted the implement-generic-processing-steps branch November 28, 2022 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant