Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly setuid/setgid after entering userns #606

Merged
merged 1 commit into from
Mar 4, 2016

Conversation

estesp
Copy link
Contributor

@estesp estesp commented Mar 1, 2016

The re-work of namespace entering lost the setuid/setgid that was part
of the Go-routine based process exec in the prior code. A side issue was
found with setting oom_score_adj before execve() in a userns that is
also solved here.

Docker-DCO-1.1-Signed-off-by: Phil Estes [email protected] (github: estesp)

@estesp
Copy link
Contributor Author

estesp commented Mar 1, 2016

strace details on the "0555" perms:

stat("/home/estesp/runc-test/rootfs.uid2000/sys",  <unfinished ...>
<... stat resumed> {st_dev=makedev(252, 1), st_ino=11013277, st_mode=S_IFDIR|0775, 
                   st_nlink=2, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=4096,
                   st_atime=2016/03/01-12:57:26, st_mtime=2016/03/01-12:57:24, st_ctime=2016/03/01-12:57:24}) = 0
mount("sysfs", "/home/estesp/runc-test/rootfs.uid2000/sys", "sysfs", 
                   MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL) = 0
stat("/home/estesp/runc-test/rootfs.uid2000/sys/fs/cgroup", {st_dev=makedev(0, 74), st_ino=3, 
                    st_mode=S_IFDIR|0555, st_nlink=2, st_uid=65534, st_gid=65534, st_blksize=4096, 
                    st_blocks=0, st_size=0, st_atime=2016/03/01-13:14:58, st_mtime=2016/03/01-13:14:58, st_ctime=2016/03/01-13:14:58}) = 0
mount("tmpfs", "/home/estesp/runc-test/rootfs.uid2000/sys/fs/cgroup", "tmpfs", 
                       MS_NOSUID|MS_NODEV|MS_NOEXEC, "mode=755") = 0
fchmodat(AT_FDCWD, "/home/estesp/runc-test/rootfs.uid2000/sys/fs/cgroup", 0555)    = 0

The stat, mount, fchmodat relate to the code highlighted: https://github.com/opencontainers/runc/blob/master/libcontainer/rootfs_linux.go#L146-L160

@estesp
Copy link
Contributor Author

estesp commented Mar 1, 2016

Also, for further detail, this is git bisect to the first commit which exhibits mount/mkdir failure:

42d5d0480107a83300a59683a73b11d17538f3dc is the first bad commit
commit 42d5d0480107a83300a59683a73b11d17538f3dc
Author: Daniel, Dao Quang Minh <[email protected]>
Date:   Mon Sep 14 00:40:43 2015 +0000

    Sets custom namespaces for init processes

@mrunalp
Copy link
Contributor

mrunalp commented Mar 1, 2016

Interesting. I wonder if it is related to the umask which is somehow different now.. (just a theory).

@estesp
Copy link
Contributor Author

estesp commented Mar 1, 2016

@mrunalp: @crosbymichael thought something similar (and so did I initially) but I can't find anything different about umask before/after this commit.

What does seem different in a full strace comparison is the way Go process startup used to handle setuid/setgid after setting up the mappings. In fact, in testing after a local vendoring-in to Docker, I now have to explicitly set up the user string to "0:0" to not end up defaulted as nobody (65534:65534) with no perms. Using runc I don't have this problem as 0:0 already happens in the spec -> libcontainer.Process setup, but it does seem to reveal something different is happening with uid/gid given the modified flow of clone/child/parent actions happening in C in the new code?

UPDATE: However.. as a counterpoint, sprinking setuid()/setgid() into the C code where I think it might need to be changes nothing about the weird 0555 stat response.

@codido
Copy link
Contributor

codido commented Mar 2, 2016

@estesp Are you sure the mode is the root cause here? Isn't 0555 actually valid?

$ unshare -rfmn
$ mount -t sysfs none /sys
$ ls -ld /sys/fs/cgroup
dr-xr-xr-x. 2 nfsnobody nfsnobody 0 Mar  1 20:40 /sys/fs/cgroup

However, I did notice that prior to the recent changes, setuid(0) was called before setting up rootfs, and this is no longer the case. As a quick hack, moving setgid/setuid before setupRootfs() seems to fix the issue.

@estesp
Copy link
Contributor Author

estesp commented Mar 2, 2016

@codido you are correct that 0555 is definitely valid, the only problem is that it prevents sub-mounts from being bound underneath that path--and my wild guess is that it has to do with not being perceived as being "really root" like it was prior to the changes.

Where did you put the setuid/setgid to resolve the issue? I tried several things yesterday without any luck.

@codido
Copy link
Contributor

codido commented Mar 2, 2016

@estesp added that just before calling setupRootfs(). It's not a valid patch, just a hack to verify that this is indeed the culprit.

@estesp
Copy link
Contributor Author

estesp commented Mar 2, 2016

@codido understand, but was curious because I tried the same but have no change in behavior. Just tried again to confirm--still no change. :(

@codido
Copy link
Contributor

codido commented Mar 2, 2016

@estesp Here's the hack I used to test this:

diff --git a/libcontainer/standard_init_linux.go b/libcontainer/standard_init_linux.go
index 935b2ee..fbfb886 100644
--- a/libcontainer/standard_init_linux.go
+++ b/libcontainer/standard_init_linux.go
@@ -82,6 +82,12 @@ func (l *linuxStandardInit) Init() error {
        label.Init()
        // InitializeMountNamespace() can be executed only for a new mount namespace
        if l.config.Config.Namespaces.Contains(configs.NEWNS) {
+               if err := system.Setgid(0); err != nil {
+                       return err
+               }
+               if err := system.Setuid(0); err != nil {
+                       return err
+               }
                if err := setupRootfs(l.config.Config, console, l.pipe); err != nil {
                        return err
                }

Before applying it, and using a more or less default spec with userns, I got this:

Timestamp: 2016-03-02 10:05:05.730733987 -0500 EST
Code: System error

Message: mkdir /data/fs/matrix/fs/sys/fs/cgroup/systemd: permission denied

Frames:
---
0: setupRootfs
Package: github.com/opencontainers/runc/libcontainer
File: rootfs_linux.go@43
---
1: Init
Package: github.com/opencontainers/runc/libcontainer.(*linuxStandardInit)
File: standard_init_linux.go@85
---
2: StartInitialization
Package: github.com/opencontainers/runc/libcontainer.(*LinuxFactory)
File: factory_linux.go@239
---
3: 1
Package: main.init
File: start.go@84
---
4: init
Package: main
File: utils.go@350
---
5: main
Package: runtime
File: proc.go@100
---
6: goexit
Package: runtime
File: asm_amd64.s@1721
WARN[0000] exit status 1                                
FATA[0000] Container start failed: [10] System error: could not synchronise with container process 

With this hack applied the container started successfully.

@estesp
Copy link
Contributor Author

estesp commented Mar 2, 2016

Ok, that is quite interesting that using Golang pkg/syscall versus libcontainer's system package has different behavior. The system package is calling RawSyscall under the covers; will have to see what Golang pkg/syscall does in their Set{u,g}id implementation that's different.

UPDATE: Lesson learned on not checking errors even with a quick hack; from pkg/syscall/syscall_linux.go:

// On linux Setuid and Setgid only affects the current thread, not the process.
// This does not match what most callers expect so we must return an error
// here rather than letting the caller think that the call succeeded.

@estesp
Copy link
Contributor Author

estesp commented Mar 2, 2016

I guess in the end, root cause is this fact:

cur uid: 65534, gid: 65534

We are not root in the user-namespaced process by the time we are running the Init() back in the Go standard container initialization code. This is not the case before #454 and seems to be corrected later in setupUser() code which actually uses the config.User value to lookup (if necessary) the uid/gid pair, and then use system.Set{u,g}id to set the current thread to that pair before the execve of the container command.

@mrunalp
Copy link
Contributor

mrunalp commented Mar 2, 2016

@estesp Previously the setuid/setgid was done after mappings were written in go standard library. We could make the calls in the C code after setsid in nsexec.c

@mrunalp
Copy link
Contributor

mrunalp commented Mar 2, 2016

@estesp Also, they are safe to make as we always require a mapping for 0.

@estesp
Copy link
Contributor Author

estesp commented Mar 2, 2016

@mrunalp I think I'm only slightly losing my mind :)

  • If I put setuid/setgid in nsexec.c all the rest of the code flow works except writing to /proc/self/oom_score_adj. If I skip this and echo {somevalue} > /proc/self/oom_score_adj after starting a shell, it works fine!
  • If I put system.Setuid/system.Setgid in libcontainer/standard_init_linux.go everything works.

I have no idea why.

@mrunalp
Copy link
Contributor

mrunalp commented Mar 2, 2016

@estesp I noticed that as well but haven't root caused it :)

@codido
Copy link
Contributor

codido commented Mar 2, 2016

@estesp @mrunalp Perhaps that's because at this point in time /proc is still the original host one?

@mrunalp
Copy link
Contributor

mrunalp commented Mar 3, 2016

@codido Even if we move setuid/setgid to beginning to Init (before the oomscoreadj) we see the same issue. What is a bit puzzling here is that previously also we were root at this point of time in the code.

@mrunalp
Copy link
Contributor

mrunalp commented Mar 3, 2016

@codido Also, the new proc mount and pivot root only happen in setupRootfs so that shouldn't be an issue as it was the same earlier, too.

@mrunalp
Copy link
Contributor

mrunalp commented Mar 3, 2016

I tried moving setOomScoreAdj after setupRootfs, but still see the same error.

@codido
Copy link
Contributor

codido commented Mar 3, 2016

@mrunalp Right. It appears that a process that calls setuid() cannot access it's own oom_score_adj until it calls re-executes. Not sure if that's intended though:

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

static void open_oom_score_adj(void)
{
    int fd;

    fd = open("/proc/self/oom_score_adj", O_WRONLY);
    if (fd < 0)
        perror("Open failed");
    else
        printf("Open returned %d\n", fd);

    close(fd);
}

int main(int argc, char *argv[])
{
    int fd;

    if (getuid() == 1000) {
        printf("After execve\n");
        open_oom_score_adj();
        return 0;
    }

    if (setuid(1000) < 0) {
        perror("setuid");
        return 1;
    }

    printf("Before execve\n");
    open_oom_score_adj();

    execv(argv[0], argv);

    return 0;
}

@mrunalp
Copy link
Contributor

mrunalp commented Mar 3, 2016

Hmm, we could try adjusting it from the parent on procReady.

Sent from my iPhone

On Mar 2, 2016, at 4:49 PM, Ido Yariv [email protected] wrote:

@mrunalp Right. It appears that a process that calls setuid() cannot access it's own oom_score_adj until it calls re-executes. Not sure if that's intended though:

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

static void open_oom_score_adj(void)
{
int fd;

fd = open("/proc/self/oom_score_adj", O_WRONLY);
if (fd < 0)
    perror("Open failed");
else
    printf("Open returned %d\n", fd);

close(fd);

}

int main(int argc, char *argv[])
{
int fd;

if (getuid() == 1000) {
    printf("After execve\n");
    open_oom_score_adj();
    return 0;
}

if (setuid(1000) < 0) {
    perror("setuid");
    return 1;
}

printf("Before execve\n");
open_oom_score_adj();

execv(argv[0], argv);

return 0;

}

Reply to this email directly or view it on GitHub.

@estesp estesp force-pushed the mount-perm-userns branch from f36b31b to ce5be72 Compare March 3, 2016 14:37
@estesp estesp changed the title Retain owner execute perm on cgroup mount Properly setuid/setgid after entering userns Mar 3, 2016
@estesp estesp force-pushed the mount-perm-userns branch from ce5be72 to 820abf1 Compare March 3, 2016 14:39
@estesp
Copy link
Contributor Author

estesp commented Mar 3, 2016

@mrunalp this has been reworked after your idea about setting oom_score_adj in the parent. PTAL

@crosbymichael hopefully this is the more appropriate fix so that I can vendor libcontainer into Docker and get shared namespaces.

@mrunalp
Copy link
Contributor

mrunalp commented Mar 3, 2016

@estesp Nice! I will test it out once before acking.

@mlaventure
Copy link
Contributor

In that case it may work by moving setting the oom score in start_child within the nsenter package before writing the sync byte, wdyt?

@mrunalp
Copy link
Contributor

mrunalp commented Mar 3, 2016

Yes another option is to do it in C code but will have to pass additional config that we can avoid by setting it from parent.

Sent from my iPhone

On Mar 3, 2016, at 8:12 AM, Kenfe-Mickaël Laventure [email protected] wrote:

In that case it may work by moving setting the oom score in start_child within the nsenter package before writing the sync byte, wdyt?


Reply to this email directly or view it on GitHub.

@estesp estesp force-pushed the mount-perm-userns branch from 820abf1 to 50a4667 Compare March 3, 2016 17:00
@estesp
Copy link
Contributor Author

estesp commented Mar 3, 2016

@mrunalp @mlaventure fixed properly now. PTAL. I even tested :)

@mrunalp
Copy link
Contributor

mrunalp commented Mar 3, 2016

@estesp Thanks! LGTM.

@mlaventure
Copy link
Contributor

nice, looks good!

The re-work of namespace entering lost the setuid/setgid that was part
of the Go-routine based process exec in the prior code. A side issue was
found with setting oom_score_adj before execve() in a userns that is
also solved here.

Docker-DCO-1.1-Signed-off-by: Phil Estes <[email protected]> (github: estesp)
@estesp estesp force-pushed the mount-perm-userns branch from 50a4667 to 178bad5 Compare March 4, 2016 16:12
@estesp
Copy link
Contributor Author

estesp commented Mar 4, 2016

rebased. @crosbymichael: I'd like to vendor this into Docker as well as another commit from earlier soon; PTAL when you have a chance

@LK4D4
Copy link
Contributor

LK4D4 commented Mar 4, 2016

btw according to http://www.openwall.com/lists/kernel-hardening/2016/01/20/14 you need to set uid and gid BEFORE you're joining namespace :)
But this LGTM

LK4D4 added a commit that referenced this pull request Mar 4, 2016
Properly setuid/setgid after entering userns
@LK4D4 LK4D4 merged commit ad2c520 into opencontainers:master Mar 4, 2016
@estesp
Copy link
Contributor Author

estesp commented Mar 4, 2016

@LK4D4 it's on my todo list :)

@estesp estesp deleted the mount-perm-userns branch March 4, 2016 16:40
@mrunalp
Copy link
Contributor

mrunalp commented Mar 4, 2016

Yeah we could do that now we are not bound to the go standard library :)

Sent from my iPhone

On Mar 4, 2016, at 8:38 AM, Alexander Morozov [email protected] wrote:

btw according to http://www.openwall.com/lists/kernel-hardening/2016/01/20/14 you need to set uid and gid BEFORE you're joining namespace :)
But this LGTM


Reply to this email directly or view it on GitHub.

@mrunalp
Copy link
Contributor

mrunalp commented Mar 4, 2016

Also if we make runc truly unprivileged for userns then we don't need to even that. I have a branch with a poc.

Sent from my iPhone

On Mar 4, 2016, at 8:44 AM, Mrunal Patel [email protected] wrote:

Yeah we could do that now we are not bound to the go standard library :)

Sent from my iPhone

On Mar 4, 2016, at 8:38 AM, Alexander Morozov [email protected] wrote:

btw according to http://www.openwall.com/lists/kernel-hardening/2016/01/20/14 you need to set uid and gid BEFORE you're joining namespace :)
But this LGTM


Reply to this email directly or view it on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants