-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
migrate cpuset reserved
partition when upgrading to 1.7+
#19847
Comments
If I create the directory on the host myself and then restart the client all is fine and I see files appear in the "reserve" directory
|
It might be useful, but I'm facing the same issue on |
It happens to us as well. Migration from 1.6.1 -> 1.7.3. In current 1.6.1 server, the nomad cpuset subsystem cgroup reservation is created in :
When upgrading to 1.7.3, we get the same reported error for the allocations:
If we restart the nodes, the cpuset subsystem reservation directory is created in the expected |
Hi everyone 👋 I'm still trying to reproduce this issue, but in the mean time would you be able to check on your Nomad client logs for a message such as Thanks! |
Hi @lgfa29
I checked the logs after the upgrade and I couldn't find
|
Thanks for the extra info @cesan3! Yeah, I just noticed that there are several paths where an error can happen, each with a different error message. Unfortunately there's not much that we can do in this case as there are multiple reasons those path creation may fail. But the agent shouldn't start in a state where it can't run tasks so I opened #19915 to handle this. |
So, quick question @lgfa29 Do we have another ticket to fix the original problem regarding the migration path from nomad 1.6.1 -> 1.7.x ? Now with the fix, my migration stops earlier when nomad agent starts:
Is there any plans to fix the migration? |
Could you check which process is keeping that path busy using something like I will reopen this issue until we better understand the problem. |
Hey @lgfa29 I presume that 2 running allocations are keeping it busy???
Maybe some cgroup active children? I checked
and
But the only way of fixing it this time was rebooting the server. |
We have as part of the upgrade from 1.6.1 -> 1.7.3 now added the /sys/fs/cgroup/cpuset/nomad/reserve directory ahead of the client restart, which resolved the issue on the majority of nodes, however some then also exhibited a similar issue but related to the /sys/fs/cgroup/cpuset/nomad/share dir not being present, which was /sys/fs/cgroup/cpuset/nomad/shared in 1.6.1 it seems. Also creating this dir ahead of the client restart helps as does a full system reboot.
Is there a reason these directories seem to of changed from 1.6.1 -> 1.7.3 from reserved and shared to reserve and share? |
Same issue while upgrade from 1.6.6 -> 1.7.7 This issue does not always reproduce stably. |
Unfortunately, in our case, to migrate from 1.6.x to 1.7.x, we had to automate the creation of the expected directories and the nomad cgroup controller removal using But once you're on 1.7.x, you can upgrade normally. |
Doing a little bit of issue cleanup. There's a workaround for the original issue here, but the upgrade path is still not very nice. I'm going to re-title this and mark it for roadmapping. The underlying issue is that in 1.7.x and beyond the name of the partition is |
reserved
partition when upgrading to 1.7+
Nomad version
$ nomad version
Nomad v1.7.3
BuildDate 2024-01-15T16:55:40Z
Revision 60ee328
Operating system and Environment details
CentOS 7
Issue
When upgrading clients from 1.6.1 - 1.7.3 we are getting the below error:
The file, as per the error doesn't exist, but it does at
/sys/fs/cgroup/cpuset/nomad/reserved/cpuset.cpus
Nomad Client logs (if appropriate)
Reproduction steps
This happened on a client which we updated from 1.6.1 -> 1.7.3, servers previously updated to 1.7.3 with no issues.
Expected Result
Job runs as expected
Actual Result
job fails to run on clients updated to 1.7.3
Job file (if appropriate)
The text was updated successfully, but these errors were encountered: