Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spread scheduling algorithm #7810

Merged
merged 4 commits into from
May 1, 2020
Merged

spread scheduling algorithm #7810

merged 4 commits into from
May 1, 2020

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Apr 27, 2020

This PR introduces spread scheduling algorithm, as an alternative to binpacking.

In mostly-static clusters, some cluster operators prefer to spread the load rather than binpack them. When spreading, clients will share the workload more evenly. It also handles failures a bit more gracefully: if an allocation misbehaves and becomes a noisy neighbor by saturating IO or network (currently not isolated by nomad), it will affect less allocations. Also, if a client becomes an unhealthy, the bounds on number of affected allocations will be tighter.

One significant downside of spreading is management of large allocations in fragmented clusters. Consider a simple case: 2 nodes with 4GB each, each are running an allocation using 2GB; A new job requiring 3GB of RAM will fail to schedule despite the cluster having 4GB free RAM. Though possible in binpacking algorithm, it's more likely under spread conditions. Operators can mitigate this by ensuring that jobs require small fraction of each node resources, sparating large jobs to its own datacenter/region, and/or ensuring that job CPU/RAM requirements are similar.

Implementation Details

This implementation adds a SchedulerAlgorithm option to the operator SchedulerConfiguration knob.

The flag controls the cluster/region level, so it's controlled by cluster operators, rather than job submitters. Given the advantages (and disadvantages) affect overall cluster performance and management rather than individual jobs, it's operators that can control it.

Future iterations may allow for scheduler config options per datacenter or another layer of pooling. Depending on demand, we can introduce a per-DC config override, for both this scheduling algorithm or for preemption too if demand raises.

@angrycub wrote most of the implementation. I mostly did some cosmetic changes.

nomad/structs/funcs_test.go Outdated Show resolved Hide resolved
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. We'll need docs and a changelog entry as well.

AFAICT this can be set via default_scheduler_config, but I thought I'd double check as it seems likely operators will want to bake this setting in from the beginning for bare metal clusters.

@notnoop
Copy link
Contributor Author

notnoop commented Apr 30, 2020

AFAICT this can be set via default_scheduler_config, but I thought I'd double check as it seems likely operators will want to bake this setting in from the beginning for bare metal clusters.

Yes! The parsing is tested with command/agent/testdata/basic.hcl change, and from there, it's the same path as before - where server commits default scheduler config to raft on becoming a leader.

@notnoop notnoop force-pushed the spread-configuration branch from d73f869 to 9962b9f Compare May 1, 2020 17:14
@notnoop notnoop merged commit f5775de into master May 1, 2020
@notnoop notnoop deleted the spread-configuration branch May 1, 2020 17:15
notnoop pushed a commit that referenced this pull request May 1, 2020
notnoop pushed a commit that referenced this pull request May 3, 2020
@spuder
Copy link
Contributor

spuder commented May 17, 2020

Is there documentation for this feature? I'd like to use spread over binpack but haven't been able to find how to implement it yet.

@angrycub
Copy link
Contributor

You can set it in an existing cluster using the API - https://www.nomadproject.io/api-docs/operator/#update-scheduler-configuration. For a new cluster, you can pre-configure the defaults here https://www.nomadproject.io/docs/configuration/server/#configuring-scheduler-config

@schmichael
Copy link
Member

@spuder: I also created #7999 to improve our documentation. It's a disproportionately small code change to the impact it has on scheduling! We should guide users appropriately.

@Legogris
Copy link

General question:

One significant downside of spreading is management of large allocations in fragmented clusters. Consider a simple case: 2 nodes with 4GB each, each are running an allocation using 2GB; A new job requiring 3GB of RAM will fail to schedule despite the cluster having 4GB free RAM.

If one of the two jobs were reallocated at that point, that would make all jobs fit. Are there any intentions to improve scheduling to be able to accommodate for redistributing already running jobs to be able to fit allocations better as jobs come and go for future versions?

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants