New tests for user namespaces and groups issue #2483
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This test illustrate an issue when trying to use runc with user namespaces in Kubernetes.
runc needs to bind mount files from /var/lib/kubelet/pods/... (such as etc-hosts) into the container. When using user namespaces, the bind mount didn't work anymore when runc is started from a systemd unit.
The workaround is to start the systemd unit with SupplementaryGroups=0.
runc needs to have permission on the directory to stat() the source of the bind mount. Without user namespaces, this is not a problem since runc is running as root, so it has 'rwx' permissions over the directory:
Moreover, runc has CAP_DAC_OVERRIDE at this point because the mount phase happens before giving up the additional permissions.
However, when using user namespaces, the runc process is belonging to a different user than root (depending on the mapping). /var/lib/kubelet is seen as belonging to the special unmapped user (65534, nobody). runc does not have 'rwx' permissions anymore but the empty '---' permission
for 'other'.
CAP_DAC_OVERRIDE is also no effective because the kernel performs the capability check with capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE). This checks that the owner of the /var/lib/kubelet is mapped in the current user namespace, which is not the case.
Despite that, bind mounting /var/lib/kubelet/pods/...etc-hosts was working when runc was started manually with 'sudo' but not working when started from a systemd unit. The difference is how supplementary groups are handled between sudo and systemd units: systemd does not set supplementary groups by default.
When runc has the supplementary group 0 configured, it is retained during the bind-mount phase, even though it is an unmapped group (runc temporarily sees 'Groups: 65534' in its own /proc/self/status), so runc effectively has the 'r-x' permissions over /var/lib/kubelet. This makes
the bind mount of etc-hosts work.
After the mount phase, runc will set the credential correctly (following OCI's config.json specification), so the container will not retain this unmapped supplementary group.
It is difficult to set up supplementary groups from Golang code automatically with syscall.Setgroups() because "at the kernel level, user IDs and group IDs are a per-thread attribute" (man setgroups) and
the way Golang uses threads make it difficult to predict which thread is going to be used to execute runc. glibc's setgroup() is a wrapper that changes the credentials for all threads but Golang does not use the glibc implementation.
@mauriciovasquezbernal @rata