-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cgroups: consider pre-setting cpuset in docker driver #12374
Comments
For better visibility from #12274 Our trading infrastructure reads the cpus in the cgroup on startup and index into the cpuset to pin threads to specific cores. If the cgroup gets modified after we've done this thread <-> core pinning, I'm not entirely sure what happens but it's bad as either threads will still be pinned to CPUs outside of the allocated cgroup or it just breaks altogether. |
@aholyoake-bc thanks for pointing out your use case, that is something where there is a detectable change in behavior - with the cgroups v1 controller you could read just the reserved cgroup cpuset, but with v2 there is only 1 cpuset that contains both reserved and shared cores. The immediate workaround would be to either avoid using a system with only cgroup v2 support (the v2 behavior is disabled on v1 and hybrid systems), or avoid upgrading Nomad. We could also consider adding a While working on cgroups v2 support I mapped out the plumbing necessary for proper NUMA aware scheduling. An important part of that will be to enable strict cpu core pinning based on system topology (i.e. memory & pcie device associativity). I think that's going to be the long term solution for use cases like yours. |
FWIW, we would be very supportive of having this information exposed in an environment variable as it would make the migration to having nomad manage the cores significantly easier. I might be pushing my luck here, but exposing it as something like
would be even dreamier |
Out of interest, what are your current ideas for the definition for the numa-aware scheduling (basically what it looks like in the job file)? |
This is off the top of my head and 1000% subject to change, but something like job "models" {
datacenters = ["dc1"]
group "build-models" {
task "model1" {
driver = "exec"
config {
command = "/bin/builder"
}
resources {
cores = 8
memory = 8192
numa {
mode = "<disable|preferred|required>"
devices = ["nvidia/gpu"]
}
device "nvidia/gpu" {
# ...
}
device "other" {
# ...
}
}
}
}
} Where As an aside I think the Client configuration could also take into account I realize this only scratches the surface of NUMA topology complexity, so by all means I'm open to feedback! |
This PR injects the 'NOMAD_CPU_CORES' environment variable into tasks that have been allocated reserved cpu cores. The value uses normal cpuset notation, as found in cpuset.cpu cgroup interface files. Note this value is not necessiarly the same as the content of the actual cpuset.cpus interface file, which will also include shared cpu cores when using cgroups v2. This variable is a workaround for users who used to be able to read the reserved cgroup cpuset file, but lose the information about distinct reserved cores when using cgroups v2. Side discussion in: #12374
#12496 sets a I'm going to go head and close this ticket, because there's no way to achieve the exactly correct behavior of setting the docker task config's cpuset to reserved + shared before launching the task, because that value is dynamic and unknowable until the task is actually launching. If we were to do something different, it would be trading one incorrect behavior for another, and in that context it seems sensible to just leave the existing behavior as-is. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
With the cgroups v2 implementation, we could potentially set the cpuset value in the
TaskConfig
to thereserved
cores. The tradeoff is I don't think we have access to thhshared
cores at that point, meaning the task will be limited to only its reserved cores at launch time.The text was updated successfully, but these errors were encountered: