runc systemd cgroup driver logic is wrong #3780

kolyshkin · 2023-03-23T02:20:00Z

Description

While investigating issue #3760 I found the following.

Issue A: leftover failed scope unit.

By default, systemd removes transient units (e.g. scopes) that have ended (are no longer active), except those that failed. The failed units are not being removed, they stay.
RHEL9's systemd, for some reason (I guess a RHEL9-specific patch I can't find yet), works differently than in does in older CentOS, or Fedora 37 (or earlier Fedora versions, I guess). The difference is, if the container is OOM-killed (which happen in "event oom" test), its scope state becomes "failed".

This above is the first issue. The container ended, but its scope stays. Here is a repro (works on CentOS Stream 9 only):

Update: this repro needs RHEL/CentOS 9 systemd version < 252.14, i.e. before they've added redhat-plumbers/systemd-rhel9#149.

[root@localhost runc]# systemctl status runc-test_busybox.scope
Unit runc-test_busybox.scope could not be found.
[root@localhost runc]# bats tests/integration/events.bats 
1..5
ok 1 events --stats
ok 2 events --interval default
ok 3 events --interval 1s
ok 4 events --interval 100ms
ok 5 events oom
[root@localhost runc]# systemctl status runc-test_busybox.scope
Warning: The unit file, source configuration file or drop-ins of runc-test_busybox.scope changed on disk. Run 'systemctl daemon->
× runc-test_busybox.scope - libcontainer container test_busybox
     Loaded: loaded (/run/systemd/transient/runc-test_busybox.scope; transient)
  Transient: yes
    Drop-In: /run/systemd/transient/runc-test_busybox.scope.d
             └─50-DevicePolicy.conf, 50-DeviceAllow.conf, 50-MemoryMax.conf
     Active: failed (Result: oom-kill) since Thu 2023-03-23 01:58:22 UTC; 2s ago
   Duration: 5.729s
        CPU: 710ms

Mar 23 01:58:16 localhost systemd[1]: Started libcontainer container test_busybox.
Mar 23 01:58:22 localhost systemd[1]: runc-test_busybox.scope: A process of this unit has been killed by the OOM killer.
Mar 23 01:58:22 localhost systemd[1]: runc-test_busybox.scope: Killing process 248957 (sh) with signal SIGKILL.
Mar 23 01:58:22 localhost systemd[1]: runc-test_busybox.scope: Killing process 249001 (sh) with signal SIGKILL.
Mar 23 01:58:22 localhost systemd[1]: runc-test_busybox.scope: Killing process 249007 (dd) with signal SIGKILL.
Mar 23 01:58:22 localhost systemd[1]: runc-test_busybox.scope: Failed with result 'oom-kill'.
[root@localhost runc]# ./runc --version
runc version 1.1.0+dev
commit: efad7a3b
spec: 1.1.0-rc.1
go: go1.19
libseccomp: 2.5.2

Issue B: Ignoring "org.freedesktop.systemd1.UnitExists" from Apply / startUnit

When runc systemd cgroup manager is used, it calls org.freedesktop.systemd1.Manager.StartTransientUnit over dbus, and waits for it to complete and return back the correct result ("done"). Except, if the call fails with "org.freedesktop.systemd1.UnitExists" error, this error is ignored and startUnit returns no error. The failure can happen because of Issue A above. In such case, runc creates a container which is not placed into a proper systemd scope (and is not placed into a proper cgroup either).

Issue C: failed integration tests on CentOS Stream 9

Many runc integration tests do not use set_cgroups_path, meaning that cgroupsPath field in spec is not set. This results in default cgroup path, which, when systemd cgroup manager is used, is :runc:<container-name>.
Many runc integration tests use the same container name, test_busybox. Together with the item 1 above, which means that the tests repeatedly create a systemd scope named runc-test_busybox.scope.
All of the above, plus the Issue A and Issue B, result is some integration tests failing, as reported in centos-stream-9 CI is failing for the main branch #3760.

Steps to reproduce the issue

See above.

Describe the results you received and expected

What version of runc are you using?

[kir@kir-rhat runc]$ ./runc --version
runc version 1.1.0+dev
commit: v1.1.0-443-g7e630356
spec: 1.1.0-rc.1
go: go1.20.2
libseccomp: 2.5.3

Host OS information

No response

Host kernel information

No response

The text was updated successfully, but these errors were encountered:

kolyshkin · 2023-03-23T02:46:12Z

Thoughts on fixing this mess.

Issue C (int test failure on CentOS 9)

Now, the Issue C can be mitigated by using random container names, or by setting random cgroup paths. But this will just make the CI green, not fixing any real bugs. So, we need to fix issues A and B instead.

Issue B (ignoring "unit exists" error from startUnit).

I am not quite sure how to address Issue B, as I don't understand why "unit exists" is ignored. The code to do that was added by #1124, which says it's done for slices (I guess this is when kubelet uses libcontainer to create slices). Side note: the logic introduced was not working and was fixed a few times later, see e.g. #1683, #1754, #1772.

I think we should NOT ignore "unit exists" error in case there's a PID to be added, since this means the PID is not placed into a proper unit and cgroup. This will also retain backward compatibility to keep the use case from #1124 working.

Solving this without breaking kubelet requires a deeper look into how the latter is using libcontainer. Alternatively, we can fix it right away, wait for the kubelet to break when they switch to runc 1.2.x, and figure out how to fix it.

Issue A (leaving a failed unit after the test)

This can be solved in two ways

By setting "CollectMode: inactive-or-failed" property either by default (i.e. in the runc code), or for those containers that are supposed to fail (i.e. in the tests). Note that using this property requires systemd version >= 236.
By using "reset-failed" operation either before "startUnit", or after it fails with "unit exists". In the second case, we need to retry.

thaJeztah · 2023-03-23T22:07:08Z

Interesting write-up!

@AkihiroSuda @dmcgowan (without having looked closely); could any of this also apply to https://github.com/containerd/cgroups ?

kolyshkin · 2023-04-03T18:28:00Z

Fixed in main branch; need to backport #3782 to release-1.1

kolyshkin · 2023-04-03T18:35:26Z

@AkihiroSuda @dmcgowan (without having looked closely); could any of this also apply to https://github.com/containerd/cgroups ?

It has the very same issue. Filed containerd/cgroups#279

kolyshkin · 2023-04-06T22:58:09Z

Fixed in main by #3782 and in release-1.1 by #3806.

According to the OCI runtime spec, runtime's delete is supposed to remove all the container's artefacts. In case systemd cgroup driver is used, and the systemd unit has failed (e.g. oom-killed), systemd won't remove the unit (that is, unless the "CollectMode: inactive-or-failed" property is set). Leaving a leftover failed unit is a violation of runtime spec; in addition, a leftover unit result in inability to start a container with the same systemd unit name (such operation will fail with "unit already exists" error). Call reset-failed from systemd's cgroup manager destroy_cgroup call, so the failed unit will be removed (by systemd) after "crun delete". This change is similar to the one in runc (see [1]). A (slightly modified) test case from runc added by the above change was used to check that the bug is fixed. For bigger picture, see [2] (issue A) and [3]. To test manually, systemd >= 244 is needed. Create a container config that runs "sleep 10" and has the following systemd annotations: org.systemd.property.RuntimeMaxUSec: "uint64 2000000" org.systemd.property.TimeoutStopUSec: "uint64 1000000" Start a container using --systemd-cgroup option. The container will be killed by systemd in 2 seconds, thus its systemd unit status will be "failed". Once it has failed, the "systemctl status $UNIT_NAME" should have exit code of 3 (meaning "unit is not active"). Now, run "crun delete $CTID" and repeat "systemctl status $UNIT_NAME". It should result in exit code of 4 (meaning "no such unit"). [1] opencontainers/runc#3888 [2] opencontainers/runc#3780 [3] cri-o/cri-o#7035 Signed-off-by: Kir Kolyshkin <[email protected]>

According to the OCI runtime spec [1], runtime's delete is supposed to remove all the container's artefacts. In case systemd cgroup driver is used, and the systemd unit has failed (e.g. oom-killed), systemd won't remove the unit (that is, unless the "CollectMode: inactive-or-failed" property is set). Leaving a leftover failed unit is a violation of runtime spec; in addition, a leftover unit result in inability to start a container with the same systemd unit name (such operation will fail with "unit already exists" error). Call reset-failed from systemd's cgroup manager destroy_cgroup call, so the failed unit will be removed (by systemd) after "crun delete". This change is similar to the one in runc (see [2]). A (slightly modified) test case from runc added by the above change was used to check that the bug is fixed. For bigger picture, see [3] (issue A) and [4]. To test manually, systemd >= 244 is needed. Create a container config that runs "sleep 10" and has the following systemd annotations: org.systemd.property.RuntimeMaxUSec: "uint64 2000000" org.systemd.property.TimeoutStopUSec: "uint64 1000000" Start a container using --systemd-cgroup option. The container will be killed by systemd in 2 seconds, thus its systemd unit status will be "failed". Once it has failed, the "systemctl status $UNIT_NAME" should have exit code of 3 (meaning "unit is not active"). Now, run "crun delete $CTID" and repeat "systemctl status $UNIT_NAME". It should result in exit code of 4 (meaning "no such unit"). [1] https://github.com/opencontainers/runtime-spec/blob/main/runtime.md#delete [2] opencontainers/runc#3888 [3] opencontainers/runc#3780 [4] cri-o/cri-o#7035 Signed-off-by: Kir Kolyshkin <[email protected]>

This was referenced Mar 23, 2023

centos-stream-9 CI is failing for the main branch #3760

Closed

Fix systemd cgroup driver's Apply #3782

Merged

kolyshkin closed this as completed in #3782 Apr 3, 2023

kolyshkin added this to the 1.1.6 milestone Apr 3, 2023

kolyshkin reopened this Apr 3, 2023

kolyshkin mentioned this issue Apr 3, 2023

NewSystemd must not ignore UnitExists containerd/cgroups#279

Closed

kolyshkin mentioned this issue Apr 3, 2023

[1.1] Fix systemd cgroup driver's Apply (and make CI green again) #3806

Merged

kolyshkin closed this as completed Apr 6, 2023

This was referenced Apr 7, 2023

Release runc 1.1.6 #3818

Merged

update runc binary to v1.1.6 containerd/containerd#8384

Merged

kolyshkin mentioned this issue Apr 24, 2023

vendor: bump runc to 1.1.6 kubernetes/kubernetes#117241

Merged

kolyshkin mentioned this issue May 15, 2023

v1.1.6 regression (rootless, cgroup v2): container's cgroup is not empty: 5 process(es) found #3828

Closed

kolyshkin mentioned this issue Jun 7, 2023

runc delete: call systemd's reset-failed #3888

Merged

kolyshkin mentioned this issue Jul 8, 2023

[1.1] runc delete: call systemd's reset-failed #3932

Merged

kolyshkin mentioned this issue Sep 6, 2023

crun delete: call systemd's reset-failed containers/crun#1295

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runc systemd cgroup driver logic is wrong #3780

runc systemd cgroup driver logic is wrong #3780

kolyshkin commented Mar 23, 2023 •

edited

Loading

kolyshkin commented Mar 23, 2023 •

edited

Loading

thaJeztah commented Mar 23, 2023

kolyshkin commented Apr 3, 2023

kolyshkin commented Apr 3, 2023

kolyshkin commented Apr 6, 2023

runc systemd cgroup driver logic is wrong #3780

runc systemd cgroup driver logic is wrong #3780

Comments

kolyshkin commented Mar 23, 2023 • edited Loading

Description

Issue A: leftover failed scope unit.

Issue B: Ignoring "org.freedesktop.systemd1.UnitExists" from Apply / startUnit

Issue C: failed integration tests on CentOS Stream 9

Steps to reproduce the issue

Describe the results you received and expected

What version of runc are you using?

Host OS information

Host kernel information

kolyshkin commented Mar 23, 2023 • edited Loading

Issue C (int test failure on CentOS 9)

Issue B (ignoring "unit exists" error from startUnit).

Issue A (leaving a failed unit after the test)

thaJeztah commented Mar 23, 2023

kolyshkin commented Apr 3, 2023

kolyshkin commented Apr 3, 2023

kolyshkin commented Apr 6, 2023

kolyshkin commented Mar 23, 2023 •

edited

Loading

kolyshkin commented Mar 23, 2023 •

edited

Loading