-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runc systemd cgroup driver logic is wrong #3780
Comments
Thoughts on fixing this mess. Issue C (int test failure on CentOS 9)Now, the Issue C can be mitigated by using random container names, or by setting random cgroup paths. But this will just make the CI green, not fixing any real bugs. So, we need to fix issues A and B instead. Issue B (ignoring "unit exists" error from startUnit).I am not quite sure how to address Issue B, as I don't understand why "unit exists" is ignored. The code to do that was added by #1124, which says it's done for slices (I guess this is when kubelet uses libcontainer to create slices). Side note: the logic introduced was not working and was fixed a few times later, see e.g. #1683, #1754, #1772. I think we should NOT ignore "unit exists" error in case there's a PID to be added, since this means the PID is not placed into a proper unit and cgroup. This will also retain backward compatibility to keep the use case from #1124 working. Solving this without breaking kubelet requires a deeper look into how the latter is using libcontainer. Alternatively, we can fix it right away, wait for the kubelet to break when they switch to runc 1.2.x, and figure out how to fix it. Issue A (leaving a failed unit after the test)This can be solved in two ways
|
Interesting write-up! @AkihiroSuda @dmcgowan (without having looked closely); could any of this also apply to https://github.com/containerd/cgroups ? |
Fixed in main branch; need to backport #3782 to release-1.1 |
It has the very same issue. Filed containerd/cgroups#279 |
According to the OCI runtime spec, runtime's delete is supposed to remove all the container's artefacts. In case systemd cgroup driver is used, and the systemd unit has failed (e.g. oom-killed), systemd won't remove the unit (that is, unless the "CollectMode: inactive-or-failed" property is set). Leaving a leftover failed unit is a violation of runtime spec; in addition, a leftover unit result in inability to start a container with the same systemd unit name (such operation will fail with "unit already exists" error). Call reset-failed from systemd's cgroup manager destroy_cgroup call, so the failed unit will be removed (by systemd) after "crun delete". This change is similar to the one in runc (see [1]). A (slightly modified) test case from runc added by the above change was used to check that the bug is fixed. For bigger picture, see [2] (issue A) and [3]. To test manually, systemd >= 244 is needed. Create a container config that runs "sleep 10" and has the following systemd annotations: org.systemd.property.RuntimeMaxUSec: "uint64 2000000" org.systemd.property.TimeoutStopUSec: "uint64 1000000" Start a container using --systemd-cgroup option. The container will be killed by systemd in 2 seconds, thus its systemd unit status will be "failed". Once it has failed, the "systemctl status $UNIT_NAME" should have exit code of 3 (meaning "unit is not active"). Now, run "crun delete $CTID" and repeat "systemctl status $UNIT_NAME". It should result in exit code of 4 (meaning "no such unit"). [1] opencontainers/runc#3888 [2] opencontainers/runc#3780 [3] cri-o/cri-o#7035 Signed-off-by: Kir Kolyshkin <[email protected]>
According to the OCI runtime spec [1], runtime's delete is supposed to remove all the container's artefacts. In case systemd cgroup driver is used, and the systemd unit has failed (e.g. oom-killed), systemd won't remove the unit (that is, unless the "CollectMode: inactive-or-failed" property is set). Leaving a leftover failed unit is a violation of runtime spec; in addition, a leftover unit result in inability to start a container with the same systemd unit name (such operation will fail with "unit already exists" error). Call reset-failed from systemd's cgroup manager destroy_cgroup call, so the failed unit will be removed (by systemd) after "crun delete". This change is similar to the one in runc (see [2]). A (slightly modified) test case from runc added by the above change was used to check that the bug is fixed. For bigger picture, see [3] (issue A) and [4]. To test manually, systemd >= 244 is needed. Create a container config that runs "sleep 10" and has the following systemd annotations: org.systemd.property.RuntimeMaxUSec: "uint64 2000000" org.systemd.property.TimeoutStopUSec: "uint64 1000000" Start a container using --systemd-cgroup option. The container will be killed by systemd in 2 seconds, thus its systemd unit status will be "failed". Once it has failed, the "systemctl status $UNIT_NAME" should have exit code of 3 (meaning "unit is not active"). Now, run "crun delete $CTID" and repeat "systemctl status $UNIT_NAME". It should result in exit code of 4 (meaning "no such unit"). [1] https://github.com/opencontainers/runtime-spec/blob/main/runtime.md#delete [2] opencontainers/runc#3888 [3] opencontainers/runc#3780 [4] cri-o/cri-o#7035 Signed-off-by: Kir Kolyshkin <[email protected]>
Description
While investigating issue #3760 I found the following.
Issue A: leftover failed scope unit.
This above is the first issue. The container ended, but its scope stays. Here is a repro (works on CentOS Stream 9 only):
Update: this repro needs RHEL/CentOS 9 systemd version < 252.14, i.e. before they've added redhat-plumbers/systemd-rhel9#149.
Issue B: Ignoring "org.freedesktop.systemd1.UnitExists" from Apply / startUnit
org.freedesktop.systemd1.Manager.StartTransientUnit
over dbus, and waits for it to complete and return back the correct result ("done"). Except, if the call fails with "org.freedesktop.systemd1.UnitExists" error, this error is ignored and startUnit returns no error. The failure can happen because of Issue A above. In such case, runc creates a container which is not placed into a proper systemd scope (and is not placed into a proper cgroup either).Issue C: failed integration tests on CentOS Stream 9
set_cgroups_path
, meaning thatcgroupsPath
field in spec is not set. This results in default cgroup path, which, when systemd cgroup manager is used, is:runc:<container-name>
.test_busybox
. Together with the item 1 above, which means that the tests repeatedly create a systemd scope namedrunc-test_busybox.scope
.centos-stream-9
CI is failing for the main branch #3760.Steps to reproduce the issue
See above.
Describe the results you received and expected
What version of runc are you using?
[kir@kir-rhat runc]$ ./runc --version
runc version 1.1.0+dev
commit: v1.1.0-443-g7e630356
spec: 1.1.0-rc.1
go: go1.20.2
libseccomp: 2.5.3
Host OS information
No response
Host kernel information
No response
The text was updated successfully, but these errors were encountered: