Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wierd issue with User Namespace and mounting volumes. #1229

Open
rhatdan opened this issue Dec 15, 2016 · 14 comments
Open

Wierd issue with User Namespace and mounting volumes. #1229

rhatdan opened this issue Dec 15, 2016 · 14 comments

Comments

@rhatdan
Copy link
Contributor

rhatdan commented Dec 15, 2016

docker run --rm -ti -v /sys/fs/cgroup:/sys/fs/cgroup:ro alpine sh
/usr/bin/docker-current: Error response from daemon: invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:359: container init caused \\\"rootfs_linux.go:54: mounting \\\\\\\"/sys/fs/cgroup\\\\\\\" to rootfs \\\\\\\"/var/lib/docker/100000.100000/overlay/f550ec60ca139a8a332e2d7d61d44c8a95c55c066a4414d8ad0d5b2780fa79d4/merged\\\\\\\" at \\\\\\\"/var/lib/docker/100000.100000/overlay/f550ec60ca139a8a332e2d7d61d44c8a95c55c066a4414d8ad0d5b2780fa79d4/merged/sys/fs/cgroup\\\\\\\" caused \\\\\\\"operation not permitted\\\\\\\"\\\"\"\n".
@rhatdan
Copy link
Contributor Author

rhatdan commented Dec 15, 2016

If you mount the volume without the :ro, it works.

docker run --rm -ti -v /sys/fs/cgroup:/sys/fs/cgroup alpine sh
/ # 

@rhatdan
Copy link
Contributor Author

rhatdan commented Dec 15, 2016

I have not isolated this to runc, but I am pretty sure the problem is there.

Also, cgroupfs is the only one we have round this issue with, mounting in /sys/fs/selinux:ro works, and turning off user namespace works.

Only cgroupsfs with usernamespace.

@rhatdan
Copy link
Contributor Author

rhatdan commented Dec 15, 2016

Gotten this to happen within runc.
cgroup.zip

This example attempts to mount /sys/fs/cgroup on /mnt readonly using runc.

Fails with usernamespace. I believe this is a kernel issue. We fixed a similar issue
with SELinux. The UserNamespace guys have gone into the kernel and a blocking some
mount options from happening if done from a user namespace to prevent exploits. The
SELinux patch was blocking context="LABEL" mounts. Since setting a mount point as "ro"
is changing the security attributes of the mount, I am betting the kernel is blocking it.

We are now looking into the kernel.

@rhatdan
Copy link
Contributor Author

rhatdan commented Dec 15, 2016

@rhvgoyal PTAL

@cyphar
Copy link
Member

cyphar commented Dec 15, 2016

@rhatdan This sounds like an issue in rootfs_linux.go (if it is an issue in runC that is). Because of the restrictions of mounting cgroupfs inside user namespaces (unless you have CAP_SYS_ADMIN in the pinned userns of your current cgroupns) we have to do a bunch of tricks to make /sys/fs/cgroup mounting work (basically we substitute it for a bindmount).

And when you set ro on a cgroupfs mount, IIRC we don't just set the bindmount to that, we actually remount it later -- maybe that's where the label issue is coming in?

@rhatdan
Copy link
Contributor Author

rhatdan commented Dec 16, 2016

I don't believe it is runc. I have it failing if I do

-v /sys/fs/cgroup:/mnt:ro

From runc point of view this is just a bind mount, but the kernel looks at this as changing the mount attributes of a cgroupfs.

30118 mount("/sys/fs/cgroup/systemd", "/mnt/rootfs/mnt", 0xc4200cfa40, MS_RDONLY|MS_REMOUNT|MS_BIND|MS_REC, NULL) = -1 EPERM (Operation not permitted)

@cyphar
Copy link
Member

cyphar commented Dec 16, 2016

@rhatdan Can you try with #1222 applied? @justincormack has proposed changing code that is handling our read-only remounting code.

@rhatdan
Copy link
Contributor Author

rhatdan commented Dec 16, 2016

Stil fails with patch.

strace -f -o /tmp/out ./runc run -b /mnt foobar | grep mount
container_linux.go:247: starting container process caused "process_linux.go:387: container init caused \"rootfs_linux.go:57: mounting \\\"/sys/fs/cgroup/systemd\\\" to rootfs \\\"/mnt/rootfs\\\" at \\\"/mnt/rootfs/mnt\\\" caused \\\"operation not permitted\\\"\""
06 <... select resumed> )            = 0 (Timeout)
18106 select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=20} <unfinished ...>
18105 <... mount resumed> )             = 0
18106 <... select resumed> )            = 0 (Timeout)
18105 mount("", "/mnt/rootfs/mnt", 0xc4200c72ac, MS_REC|MS_PRIVATE, NULL <unfinished ...>
18106 select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=20} <unfinished ...>
18105 <... mount resumed> )             = 0
18105 mount("/sys/fs/cgroup/systemd", "/mnt/rootfs/mnt", 0xc4200c7300, MS_RDONLY|MS_REMOUNT|MS_BIND|MS_REC, NULL <unfinished ...>
18106 <... select resumed> )            = 0 (Timeout)
18105 <... mount resumed> )             = -1 EPERM (Operation not permitted)

@rhatdan
Copy link
Contributor Author

rhatdan commented Dec 16, 2016

uname -r
4.10.0-0.rc0.git2.1.1.secnext.fc26.x86_64

@rhatdan
Copy link
Contributor Author

rhatdan commented Dec 16, 2016

Ok @rhvgoyal reproduced this with usernamespace and mounting cgroups outside of runc, definitely a kernel issue.

@cyphar
Copy link
Member

cyphar commented Dec 16, 2016

Alright, keep us up to date on what happens. 😸

@rhatdan
Copy link
Contributor Author

rhatdan commented Mar 12, 2017

@rhvgoyal Isn't this fixed in the kernel now?

@adelton
Copy link

adelton commented May 10, 2018

I still see things failing with kernel-4.16.6-202.fc27.x86_64 and kernel-4.16.7-300.fc28.x86_64, noted in https://bugzilla.redhat.com/show_bug.cgi?id=1401944#c13.

@alban
Copy link
Contributor

alban commented Sep 8, 2020

I can reproduce this behaviour on Linux 5.6.11-200.fc31.x86_64 with runc-master (09ddc63) on /tmp but not on /var.

Error message:

ERRO[0000] container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: rootfs_linux.go:60: mounting "/tmp/volume1/dir" to rootfs at "/home/alban/go/src/github.com/opencontainers/runc/container-userns-bind-mount-ro/rootfs/mnt/ro" caused: operation not permitted 

Strace shows a bind mount followed by a remount:

[pid 3881982] mount("/tmp/volume1/dir", "/home/alban/go/src/github.com/opencontainers/runc/container-userns-bind-mount-ro/rootfs/mnt/ro", 0xc0000a7999, MS_RDONLY|MS_BIND|MS_REC, NULL) = 0
[pid 3881982] mount("/tmp/volume1/dir", "/home/alban/go/src/github.com/opencontainers/runc/container-userns-bind-mount-ro/rootfs/mnt/ro", 0xc0000a79b0, MS_RDONLY|MS_REMOUNT|MS_BIND|MS_REC, NULL) = -1 EPERM (Operation not permitted)

The difference between /tmp and /var on Fedora are the mount flags nosuid,nodev. If I add those flags in the mount config in config.json, then /tmp works too. Note that /sys/fs/cgroup is mounted with nosuid,nodev too, so this explains the behaviour.

See checks in the kernel https://github.com/torvalds/linux/blob/v5.8/fs/namespace.c#L2482-L2488

	if ((fl & MNT_LOCK_NODEV) &&
	    !(mnt_flags & MNT_NODEV))
		return false;

	if ((fl & MNT_LOCK_NOSUID) &&
	    !(mnt_flags & MNT_NOSUID))
		return false;

The /tmp bind mount to the container is marked as MNT_LOCK_NODEV and MNT_LOCK_NOSUID because of user namespaces. See https://github.com/torvalds/linux/blob/v5.8/fs/namespace.c#L2158-L2160

		/* Notice when we are propagating across user namespaces */
		if (child->mnt_parent->mnt_ns->user_ns != user_ns)
			lock_mnt_tree(child);

So I think this is not a bug and users should add nosuid,nodev in config.json when needed.

Reproducible steps:

git checkout master
make
mkdir container-userns-bind-mount-ro
cd container-userns-bind-mount-ro
mkdir rootfs
docker export $(docker create busybox) | tar -C rootfs -xvf -
../runc spec
cat config.json | jq . > config.json.vanilla
cp config.json.vanilla config.json
vim config.json
diff -u config.json.vanilla config.json > config.json.patch
(
cat <<EOF
--- config.json.vanilla	2020-09-08 11:29:22.144161978 +0200
+++ config.json	2020-09-08 11:40:18.120572861 +0200
@@ -7,7 +7,7 @@
       "gid": 0
     },
     "args": [
-      "sh"
+      "/bin/sh", "-c", "ls -la /mnt/*"
     ],
     "env": [
       "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
@@ -129,6 +129,15 @@
         "relatime",
         "ro"
       ]
+    },
+    {
+      "destination": "/mnt/ro",
+      "type": "none",
+      "source": "/tmp/volume1/dir",
+      "options": [
+        "rbind",
+        "ro"
+      ]
     }
   ],
   "linux": {
@@ -155,6 +164,23 @@
       },
       {
         "type": "mount"
+      },
+      {
+        "type": "user"
+      }
+    ],
+    "uidMappings": [
+      {
+        "containerID": 0,
+        "hostID": 1000,
+        "size": 32000
+      }
+    ],
+    "gidMappings": [
+      {
+        "containerID": 0,
+        "hostID": 1000,
+        "size": 32000
       }
     ],
     "maskedPaths": [
EOF
) > config.json.patch
cat config.json.patch | patch config.json
sudo mkdir -p /var/mytmp/dir
sudo mkdir -p /tmp/volume1/dir
sudo ../runc run mycontainer1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants