-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement generic processing steps #650
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Note that it will not pass the CI because - the CI token is not allowed to push to refs/convert/parquet (should be in the "datasets-maintainers" org) - the refs/convert/parquet does not exist and cannot be created for now
we don't use it, and it's private for now
associate each parquet file with a split and a config (based on path parsing)
Gated datasets with extra fields are not supported. Note also that only one token is used now.
Also fix the tests, and disable gated+private for now
also: rename functions to be more accurate
and replace the last scripts with the /cancel-jobs/xxx endpoints.
also: replace Dict with Mapping
some tests have been moved (commented yet) to e2e, since it becomes hard to simulate all the Hub endpoints -> better to test the scenari against the real Hub instead
since it's not the scope of this PR
Closed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Generic implementation of a processing graph
Remove explicit mentions to /splits or /first-rows from code, and move them to the "processing graph":
This JSON (see libcommon.config) defines the processing steps (here /splits and /first-rows) and their dependency relationship (here /first-rows depends on /splits). It also defines if a processing step is required by the Hub dataset viewer (used to fill /valid and /is-valid).
A processing step is defined by the endpoint (/splits, /first-rows), where the result of the processing step can be downloaded. The endpoint value is also used as the cache key and the job type.
After this change, adding a new processing step should consist in:
workers/
directoryThis also means that the services (API, admin) don't contain any code mentioning directly splits or first-rows. And the splits worker does not contain direct reference to first-rows.
Other changes
This way, the code for every worker now only contains what is specific to that worker.
Breaking changes
QUEUE_MAX_LOAD_PCT
,QUEUE_MAX_MEMORY_PCT
andQUEUE_SLEEP_SECONDS
are renamed asWORKER_MAX_LOAD_PCT
,WORKER_MAX_MEMORY_PCT
andWORKER_SLEEP_SECONDS
.