Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New tests for user namespaces and groups issue #2483

Closed

Conversation

alban
Copy link
Contributor

@alban alban commented Jun 22, 2020

This test illustrate an issue when trying to use runc with user namespaces in Kubernetes.

$ make integration TESTPATH="/userns.bats" 
1..3
ok 1 userns without mount
ok 2 userns with simple mount
not ok 3 userns with difficult mount
# (in test file tests/integration/userns.bats, line 47)
#   `[ "$status" -eq 0 ]' failed
# runc list (status=0):
# ID          PID         STATUS      BUNDLE      CREATED     OWNER
# runc spec (status=0):
# 
# runc run test_userns_with_difficult_mount (status=1):
# time="2020-06-22T13:48:26Z" level=warning msg="exit status 1"
# time="2020-06-22T13:48:26Z" level=error msg="container_linux.go:367: starting container process caused: process_linux.go:459: container init caused: rootfs_linux.go:58: mounting \"/tmp/busyboxtest/source-inaccessible/dir\" to rootfs at \"/tmp/inaccessible\" caused: stat /tmp/busyboxtest/source-inaccessible/dir: permission denied"
# runc list (status=0):
# ID          PID         STATUS      BUNDLE      CREATED     OWNER
make: *** [Makefile:86: localintegration] Error 1
make: *** [Makefile:79: integration] Error 2

runc needs to bind mount files from /var/lib/kubelet/pods/... (such as etc-hosts) into the container. When using user namespaces, the bind mount didn't work anymore when runc is started from a systemd unit.

The workaround is to start the systemd unit with SupplementaryGroups=0.

runc needs to have permission on the directory to stat() the source of the bind mount. Without user namespaces, this is not a problem since runc is running as root, so it has 'rwx' permissions over the directory:

drwxr-x---. 8 root   root   4096 May 28 18:05 /var/lib/kubelet

Moreover, runc has CAP_DAC_OVERRIDE at this point because the mount phase happens before giving up the additional permissions.

However, when using user namespaces, the runc process is belonging to a different user than root (depending on the mapping). /var/lib/kubelet is seen as belonging to the special unmapped user (65534, nobody). runc does not have 'rwx' permissions anymore but the empty '---' permission
for 'other'.

CAP_DAC_OVERRIDE is also no effective because the kernel performs the capability check with capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE). This checks that the owner of the /var/lib/kubelet is mapped in the current user namespace, which is not the case.

Despite that, bind mounting /var/lib/kubelet/pods/...etc-hosts was working when runc was started manually with 'sudo' but not working when started from a systemd unit. The difference is how supplementary groups are handled between sudo and systemd units: systemd does not set supplementary groups by default.

$ sudo grep -E 'Groups:|Uid:|Gid:' /proc/self/status
Uid:	0	0	0	0
Gid:	0	0	0	0
Groups:	0

$ sudo systemd-run -t grep -E 'Groups:|Uid:|Gid:' /proc/self/status
Running as unit: run-u296886.service
Press ^] three times within 1s to disconnect TTY.
Uid:	0	0	0	0
Gid:	0	0	0	0
Groups:

When runc has the supplementary group 0 configured, it is retained during the bind-mount phase, even though it is an unmapped group (runc temporarily sees 'Groups: 65534' in its own /proc/self/status), so runc effectively has the 'r-x' permissions over /var/lib/kubelet. This makes
the bind mount of etc-hosts work.

After the mount phase, runc will set the credential correctly (following OCI's config.json specification), so the container will not retain this unmapped supplementary group.

It is difficult to set up supplementary groups from Golang code automatically with syscall.Setgroups() because "at the kernel level, user IDs and group IDs are a per-thread attribute" (man setgroups) and
the way Golang uses threads make it difficult to predict which thread is going to be used to execute runc. glibc's setgroup() is a wrapper that changes the credentials for all threads but Golang does not use the glibc implementation.

@mauriciovasquezbernal @rata

This test illustrate an issue when trying to use runc with user
namespaces in Kubernetes.

runc needs to bind mount files from /var/lib/kubelet/pods/... (such as
etc-hosts) into the container. When using user namespaces, the bind
mount didn't work anymore when runc is started from a systemd unit.

The workaround is to start the systemd unit with SupplementaryGroups=0.

runc needs to have permission on the directory to stat() the source of
the bind mount. Without user namespaces, this is not a problem since
runc is running as root, so it has 'rwx' permissions over the directory:

drwxr-x---. 8 root   root   4096 May 28 18:05 /var/lib/kubelet

Moreover, runc has CAP_DAC_OVERRIDE at this point because the mount
phase happens before giving up the additional permissions.

However, when using user namespaces, the runc process is belonging to a
different user than root (depending on the mapping). /var/lib/kubelet is
seen as belonging to the special unmapped user (65534, nobody). runc
does not have 'rwx' permissions anymore but the empty '---' permission
for 'other'.

CAP_DAC_OVERRIDE is also no effective because the kernel performs the
capability check with capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE).
This checks that the owner of the /var/lib/kubelet is mapped in the
current user namespace, which is not the case.

Despite that, bind mounting /var/lib/kubelet/pods/...etc-hosts was
working when runc was started manually with 'sudo' but not working
when started from a systemd unit. The difference is how supplementary
groups are handled between sudo and systemd units: systemd does not set
supplementary groups by default.

$ sudo grep -E 'Groups:|Uid:|Gid:' /proc/self/status
Uid:	0	0	0	0
Gid:	0	0	0	0
Groups:	0

$ sudo systemd-run -t grep -E 'Groups:|Uid:|Gid:' /proc/self/status
Running as unit: run-u296886.service
Press ^] three times within 1s to disconnect TTY.
Uid:	0	0	0	0
Gid:	0	0	0	0
Groups:

When runc has the supplementary group 0 configured, it is retained
during the bind-mount phase, even though it is an unmapped group (runc
temporarily sees 'Groups: 65534' in its own /proc/self/status), so runc
effectively has the 'r-x' permissions over /var/lib/kubelet. This makes
the bind mount of etc-hosts work.

After the mount phase, runc will set the credential correctly (following
OCI's config.json specification), so the container will not retain this
unmapped supplementary group.

It is difficult to set up supplementary groups from Golang code
automatically with syscall.Setgroups() because "at the kernel level,
user IDs and group IDs are a per-thread attribute" (man setgroups) and
the way Golang uses threads make it difficult to predict which thread is
going to be used to execute runc. glibc's setgroup() is a wrapper that
changes the credentials for all threads but Golang does not use the
glibc implementation.

Signed-off-by: Alban Crequy <[email protected]>
@AkihiroSuda
Copy link
Member

CI failing

@rata
Copy link
Member

rata commented Jul 8, 2020

@AkihiroSuda Hi! I'm working with @alban and @mauriciovasquezbernal.

Yes, sorry for the confusion. This PR is not intended to be merged as is, but as a reproducer of an issue in an easy way to test in runc. This is the associated issue: #2484.

It will be nice to merge this only after the issue is fixed (i.e. the tests added here pass). Let us know how it is best to proceed, or feel free to close the PR (it is stll linked from the issue and code available, so enough as a reproducer) or what you think is best.

Thanks,
Rodrigo

@rata
Copy link
Member

rata commented Aug 12, 2021

@alban tests in this PR have been carried over to #2576 that fixes the issue and adds the tests afterwards. Can you please close this PR?

@alban alban closed this Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants