Root-task withholding without co-assignment #6631

fjetter · 2022-06-24T16:19:15Z

We had an early attempt to experiment with root-task withholding to address the problem of root-task-overproduction. Below a couple of links with additional information (non-exhaustive)

We started an experimentation trying to withhold worker assignment for root tasks, i.e. delay worker assignment scheduler side, see #6560

Early prototypes show very promising results that should improve our cluster memory footprint. A prototype is available at #6614 (and should be ready to try for curious users)

Given that the current co-assignment logic has some significant shortcomings (e.g. #6597) and the withholding of root-tasks appears to be sufficient to control our memory footprint (some experimentation on configuration is still required) we should get the root-task withhold logic in a production ready, i.e. merge-able state and get rid of the current co-assignment logic.

This should be verified by thorough performance benchmark results, for this, see coiled/benchmarks#191 for work on automated benchmarks.

Once this is solid, we may consider adding a more robust co-assignment logic in a follow up step, if necessary.

AC

The prototype PR is merged and the new assignment logic is hidden behind a feature toggle
The feature toggle is disabled by default
There is a CI job with an experimental flag running on ubuntu on a single python version that has this feature toggle enabled. All failing tests are specifically marked and are allowed to be skipped on this job.
A follow up ticket with an overview of all skipped tests is created

gjoseph92 · 2022-06-24T16:50:06Z

#6614 currently implements this behind a feature flag. When the feature flag is turned off (current default), scheduling logic stays as-is, not only keeping co-assignment, but even fixing #6597.

For this ticket, is root task withholding by default the goal, or do we just want to get it in behind a feature flag?

I imagine performance benchmarks will be an important part of answering this question, as well community input. But there's a also the question getting the entire test suite to pass under a new scheduling approach, and whether that's in scope or should be a follow-up task.

gjoseph92 · 2022-08-31T15:03:06Z

Reopening, since these still need to happen:

There is a CI job with an experimental flag running on ubuntu on a single python version that has this feature toggle enabled. All failing tests are specifically marked and are allowed to be skipped on this job. CI job running tests with queuing on #6989
A follow up ticket with an overview of all skipped tests is created: Tests skipped with queuing active #6998

fjetter added enhancement Improve existing functionality or make things work better performance scheduling labels Jun 24, 2022

fjetter mentioned this issue Jun 24, 2022

Design and prototype for root-ish task deprioritization by withholding tasks on the scheduler #6560

Closed

hayesgb assigned gjoseph92 Aug 5, 2022

gjoseph92 mentioned this issue Aug 17, 2022

Withhold root tasks [no co assignment] #6614

Merged

2 tasks

crusaderky mentioned this issue Aug 24, 2022

Document Scheduler and Worker state machine #6948

Merged

fjetter closed this as completed in #6614 Aug 31, 2022

gjoseph92 reopened this Aug 31, 2022

gjoseph92 mentioned this issue Sep 2, 2022

CI job running tests with queuing on #6989

Merged

2 tasks

crusaderky closed this as completed in #6989 Sep 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Root-task withholding without co-assignment #6631

Root-task withholding without co-assignment #6631

fjetter commented Jun 24, 2022 •

edited

Loading

gjoseph92 commented Jun 24, 2022

gjoseph92 commented Aug 31, 2022 •

edited

Loading

Root-task withholding without co-assignment #6631

Root-task withholding without co-assignment #6631

Comments

fjetter commented Jun 24, 2022 • edited Loading

AC

gjoseph92 commented Jun 24, 2022

gjoseph92 commented Aug 31, 2022 • edited Loading

fjetter commented Jun 24, 2022 •

edited

Loading

gjoseph92 commented Aug 31, 2022 •

edited

Loading