Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A container can join namespaces of another container #105

Closed
wants to merge 8 commits into from

Conversation

dqminh
Copy link
Contributor

@dqminh dqminh commented Jul 9, 2015

Original PR and discussion: docker-archive/libcontainer#609

Enable moby/moby#13453 and moby/moby#10163

This supports setting existing namespaces for a container's init process. It leverages the same C code infrastructure for execin to clone a new init process with the wanted namespaces and cloneflags. Now all namespaces related operation is performed in nsenter C layer, with data sent from Go layer using a simple binary format.

@dqminh
Copy link
Contributor Author

dqminh commented Jul 9, 2015

@vishvananda This is related to #101
IIRC joining another user namespace still doesn't work yet for init process. I didn't have time to look into the errors further. There are also some discussion related to that in docker-archive/libcontainer#609

namespaces[i]);
fds[i] = open(ns, O_RDONLY);
if (fds[i] == -1) {
savedErr = errno;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can call pr_perror and then close descritores. In this case you will not need to save errno in a temporary variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, i think we also discussed before about this but i forgot to move the patch over. Since we exit anyway, it's ok to not having to close the fds IIRC, right ?

@mrunalp
Copy link
Contributor

mrunalp commented Jul 9, 2015

Tests fixed now. They were failing because of #109

@@ -779,3 +817,40 @@ func (c *linuxContainer) currentState() (*State, error) {
}
return state, nil
}

// orderNamespacePaths sorts that namespace paths into a list of paths that we
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/that/the

@vishvananda
Copy link

@LK4D4 @mrunalp Updated code to remove cloneflags and refactor process: dqminh#1

@dqminh dqminh force-pushed the libcontainer-pidns branch 2 times, most recently from 09636e3 to 5359301 Compare July 12, 2015 22:31
@dqminh dqminh force-pushed the libcontainer-pidns branch from 667fb58 to 652a368 Compare July 13, 2015 05:27
@dqminh
Copy link
Contributor Author

dqminh commented Jul 13, 2015

@avagin added your suggestion on setns all custom namespaces first, then clone new namespaces after . This is currently done only when we have custom user namespace for now ( we probably can also refactor further to unify both custom and default i.e., also write uid/gid mapping in C instead of Go ). Can you take a look if the current approach looks right ?

@vishvananda can you help testing this patch with your custom network namespace ? It's different from your approach but i think it's simpler.

// otherwise do it inside nsexec by passing the clone flags because we dont
// have to perform any additional setup when start a new process.
if c.config.Namespaces.PathOf(configs.NEWUSER) == "" {
cmd.SysProcAttr.Cloneflags = cloneFlags
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need to create a new user namespace and attach to a precreated pid namespace, you need to enter into the pid namespace and only then create a user namespace, so the user namespace should be created from nsexec.c, should not it?

@vishvananda
Copy link

It is confusing to be using cloneflags in the go process for new userns which puts it at the beginning of the list, but then setns into it at the end of the list if you specify a path. I suggest pulling in my patches that explicitly set the uid_map and gid_map in our code instead of relying on the go exec to do it. Alternatively, we could always fork and set the uid_map and gid_map in the c code, using a pipe to sync between the parent and child as is suggested here: https://lwn.net/Articles/532593/

@dqminh
Copy link
Contributor Author

dqminh commented Aug 24, 2015

Finally having internet at home, so i can have some cycle to work on this and other issues again 🐳

Given that we want to create a new process with clone flags in C always (meant that we cant depend on uid/gid_map file creation in Go, i think it's much more convenient to pass all the data we need into C layer, do work there and pass back to Go when we are finally done. I'm thinking of:

  • passing a json data structure from Go to C, which includes all values we currently specified as environment variables, and more data such as the content of uid_map/gid_map file, list of namespace paths to be joined, setgroups etc. Just enough for the C layer to join namespaces, then create new process with correct clone flags, then write uid_map/gid_map/setgroups if necessary. Then it returns back to Go layer. This allows us more flexibility and not having to pass control signal back and forth many times.
  • Linking to libyajl to process JSON. if you have a better alternative, please let me know

Thoughts @mrunalp @avagin @vishvananda ?

@avagin
Copy link
Contributor

avagin commented Aug 25, 2015

I am not sure that we need json here, maybe a handmade binary format would be enough. I think it will be obvious when you show code.

@dqminh
Copy link
Contributor Author

dqminh commented Aug 25, 2015

@avagin hmm yes a binary format also works. JSON is just an implementation detail since it's easy to generate from Go, and easy enough to parse from C with a library. Not requiring external dependency is a big win though.

@mrunalp
Copy link
Contributor

mrunalp commented Aug 25, 2015

+1 Yeah, it will be good if we can avoid a dependency.

estesp added a commit to estesp/docker that referenced this pull request Aug 26, 2015
For now a hack for the 1.6.1-userns branch (modifying vendored
code) to enable user namespace join on exec.  This will be
more thorougly corrected (for other use cases) in a PR for
opencontainers/runc#105.

Docker-DCO-1.1-Signed-off-by: Phil Estes <[email protected]>
@dqminh dqminh force-pushed the libcontainer-pidns branch from 652a368 to 2142fbc Compare September 1, 2015 00:36
@dqminh
Copy link
Contributor Author

dqminh commented Sep 14, 2015

This depends on #43 now for new netlink library.

@avagin updated with changes to use a netlink message to send data from Go to C layer now, also includes some fixes from last round of review. PTAL.

This adds `configs.IsNamespaceSupported(nsType)` to check if the host supports
a namespace type.

Signed-off-by: Daniel, Dao Quang Minh <[email protected]>
Signed-off-by: Daniel, Dao Quang Minh <[email protected]>
Signed-off-by: Daniel, Dao Quang Minh <[email protected]>
This adds orderNamespacePaths to get correct order of namespaces for the
bootstrap program to join.

Signed-off-by: Daniel, Dao Quang Minh <[email protected]>
An init process can join other namespaces (pidns, ipc etc.). This leverages
C code defined in nsenter package to spawn a process with correct namespaces
and clone if necessary.

This moves all setns and cloneflags related code to nsenter layer, which mean
that we dont use Go os/exec to create process with cloneflags and set
uid/gid_map or setgroups anymore. The necessary data is passed from Go to C
using a netlink binary-encoding format.

With this change, setns and init processes are almost the same, which brings
some opportunity for refactoring.

Signed-off-by: Daniel, Dao Quang Minh <[email protected]>
we dont have a need for this method, since all namespaces are joined/created in
nsenter

Signed-off-by: Daniel, Dao Quang Minh <[email protected]>
setsid is called in nsenter, so we dont have to call it a second time.

Signed-off-by: Daniel, Dao Quang Minh <[email protected]>
@dqminh
Copy link
Contributor Author

dqminh commented Sep 14, 2015

rebased against master.

@crosbymichael
Copy link
Member

@dqminh are you still interested in this PR?

Right now it is kinda hard to review. @LK4D4 and I were look at this today and maybe we can get this merged quicker if we split up the PR.

Maybe start by a PR adding netlink for communication instead of json over pipe.
Then do one with the refactoring needed, etc...

What do you think?

@dqminh
Copy link
Contributor Author

dqminh commented Oct 14, 2015

@crosbymichael @LK4D4 yes, i would still like to finish it. I think i can spend some more time on this in the next few days. I think the behavior should be more or less finalized pending the netlink changes. I need to take a look at the netlink structure again

Not sure about replacing json over pipe with netlink. We are still going to do the json over pipe to send config over. Netlink messages is just to send the necessary data structure to C-land so it can setup the namespace and clone properly.

Maybe it's possible to split the netlink part into a separate PR so we can replace environment variable usage first, and then change the way we setup namespaces later. I will see how much cleaner this is.

@LK4D4
Copy link
Contributor

LK4D4 commented Oct 14, 2015

@dqminh but json config will be passed further to Golang, right?
Replacing env vars with netlink first sounds good to me. Maybe makes sense to start split C code to different files too.

@dqminh
Copy link
Contributor Author

dqminh commented Oct 14, 2015

@LK4D4 yes, the same pipe is used to pass netlink data to C-land first, and then later on json data to Go-land.

@rootfs
Copy link

rootfs commented Dec 2, 2015

+1
what's the status/plan for this?

@dqminh
Copy link
Contributor Author

dqminh commented Dec 3, 2015

@rootfs we are trying to get the main refactoring reviewed at #340 then this change will be much simpler here.

@ianlewis
Copy link

ianlewis commented Feb 2, 2016

It would be awesome to get PID namespace sharing into Docker sometime soon. It looks like #340 was merged. Anything else holding this up?

@estesp
Copy link
Contributor

estesp commented Mar 1, 2016

I think this can be closed now that #454 is merged?

@crosbymichael
Copy link
Member

Yes, thanks

@ashahab-altiscale
Copy link
Contributor

Have you tested it for joining userns of another container?
On Feb 29, 2016 5:13 PM, "Phil Estes" [email protected] wrote:

I think this can be closed now that #454
#454 is merged?


Reply to this email directly or view it on GitHub
#105 (comment).

@estesp
Copy link
Contributor

estesp commented Mar 1, 2016

@ashahab-altiscale yes, it works correctly. Working on the changes/PR to update Docker's vendor-in to do more testing, but changes to systemd/cgroup mounts on the libcontainer side are breaking userns at the moment..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.