-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spread scheduling algorithm #7810
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. We'll need docs and a changelog entry as well.
AFAICT this can be set via default_scheduler_config
, but I thought I'd double check as it seems likely operators will want to bake this setting in from the beginning for bare metal clusters.
Yes! The parsing is tested with |
d73f869
to
9962b9f
Compare
Is there documentation for this feature? I'd like to use spread over binpack but haven't been able to find how to implement it yet. |
You can set it in an existing cluster using the API - https://www.nomadproject.io/api-docs/operator/#update-scheduler-configuration. For a new cluster, you can pre-configure the defaults here https://www.nomadproject.io/docs/configuration/server/#configuring-scheduler-config |
General question:
If one of the two jobs were reallocated at that point, that would make all jobs fit. Are there any intentions to improve scheduling to be able to accommodate for redistributing already running jobs to be able to fit allocations better as jobs come and go for future versions? |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This PR introduces spread scheduling algorithm, as an alternative to binpacking.
In mostly-static clusters, some cluster operators prefer to spread the load rather than binpack them. When spreading, clients will share the workload more evenly. It also handles failures a bit more gracefully: if an allocation misbehaves and becomes a noisy neighbor by saturating IO or network (currently not isolated by nomad), it will affect less allocations. Also, if a client becomes an unhealthy, the bounds on number of affected allocations will be tighter.
One significant downside of spreading is management of large allocations in fragmented clusters. Consider a simple case: 2 nodes with 4GB each, each are running an allocation using 2GB; A new job requiring 3GB of RAM will fail to schedule despite the cluster having 4GB free RAM. Though possible in binpacking algorithm, it's more likely under spread conditions. Operators can mitigate this by ensuring that jobs require small fraction of each node resources, sparating large jobs to its own datacenter/region, and/or ensuring that job CPU/RAM requirements are similar.
Implementation Details
This implementation adds a
SchedulerAlgorithm
option to the operator SchedulerConfiguration knob.The flag controls the cluster/region level, so it's controlled by cluster operators, rather than job submitters. Given the advantages (and disadvantages) affect overall cluster performance and management rather than individual jobs, it's operators that can control it.
Future iterations may allow for scheduler config options per datacenter or another layer of pooling. Depending on demand, we can introduce a
per-DC
config override, for both this scheduling algorithm or for preemption too if demand raises.@angrycub wrote most of the implementation. I mostly did some cosmetic changes.