You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We had an early attempt to experiment with root-task withholding to address the problem of root-task-overproduction. Below a couple of links with additional information (non-exhaustive)
We started an experimentation trying to withhold worker assignment for root tasks, i.e. delay worker assignment scheduler side, see #6560
Early prototypes show very promising results that should improve our cluster memory footprint. A prototype is available at #6614 (and should be ready to try for curious users)
Given that the current co-assignment logic has some significant shortcomings (e.g. #6597) and the withholding of root-tasks appears to be sufficient to control our memory footprint (some experimentation on configuration is still required) we should get the root-task withhold logic in a production ready, i.e. merge-able state and get rid of the current co-assignment logic.
This should be verified by thorough performance benchmark results, for this, see coiled/benchmarks#191 for work on automated benchmarks.
Once this is solid, we may consider adding a more robust co-assignment logic in a follow up step, if necessary.
AC
The prototype PR is merged and the new assignment logic is hidden behind a feature toggle
The feature toggle is disabled by default
There is a CI job with an experimental flag running on ubuntu on a single python version that has this feature toggle enabled. All failing tests are specifically marked and are allowed to be skipped on this job.
A follow up ticket with an overview of all skipped tests is created
The text was updated successfully, but these errors were encountered:
#6614 currently implements this behind a feature flag. When the feature flag is turned off (current default), scheduling logic stays as-is, not only keeping co-assignment, but even fixing #6597.
For this ticket, is root task withholding by default the goal, or do we just want to get it in behind a feature flag?
I imagine performance benchmarks will be an important part of answering this question, as well community input. But there's a also the question getting the entire test suite to pass under a new scheduling approach, and whether that's in scope or should be a follow-up task.
There is a CI job with an experimental flag running on ubuntu on a single python version that has this feature toggle enabled. All failing tests are specifically marked and are allowed to be skipped on this job. CI job running tests with queuing on #6989
We had an early attempt to experiment with root-task withholding to address the problem of root-task-overproduction. Below a couple of links with additional information (non-exhaustive)
We started an experimentation trying to withhold worker assignment for root tasks, i.e. delay worker assignment scheduler side, see #6560
Early prototypes show very promising results that should improve our cluster memory footprint. A prototype is available at #6614 (and should be ready to try for curious users)
Given that the current co-assignment logic has some significant shortcomings (e.g. #6597) and the withholding of root-tasks appears to be sufficient to control our memory footprint (some experimentation on configuration is still required) we should get the root-task withhold logic in a production ready, i.e. merge-able state and get rid of the current co-assignment logic.
This should be verified by thorough performance benchmark results, for this, see coiled/benchmarks#191 for work on automated benchmarks.
Once this is solid, we may consider adding a more robust co-assignment logic in a follow up step, if necessary.
AC
The text was updated successfully, but these errors were encountered: