Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix clone(2) with double fork #217

Merged
merged 5 commits into from
Aug 24, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 23 additions & 15 deletions docs/doc-draft.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,24 +23,26 @@ This is diagram as given in #14, which is not actually how this works, but helpf
sequenceDiagram
participant U as User
participant D as Docker
participant YP as Youki(Parent Process)
participant YI as Youki(Init Process)
participant Y_Main as Youki(Main Process)
participant Y_Intermediate as Youki(Intermeidate Process)
yihuaf marked this conversation as resolved.
Show resolved Hide resolved
participant Y_init as Youki(Init Process)


U ->> D : $ docker run --rm -it --runtime youki $image
D ->> YP : youki create $container_id
YP ->> YI : clone(2) to create new process and new namespaces
YI ->> YP : set user id mapping if entering into usernamespaces
YI ->> YI : configure resource limits, mount the devices, and etc.
YI -->> YP : ready message (Unix domain socket)
YP ->> : set cgroup configuration for YI
YP ->> D : exit $code
D ->> YP : $ youki start $container_id
YP -->> YI : start message (Unix domain socket)
YI ->> YI : run the commands in dockerfile
D ->> Y_Main : youki create $container_id
Y_Main ->> Y_Intermediate : fork(2) to create new intermediate process, entering into user and pid namespaces.
Y_Intermediate ->> Y_Main : set user id mapping if entering into usernamespaces
Y_Intermediate ->> Y_Init: fork(2) to create the container init process.
Y_Init ->> Y_Init : configure resource limits, mount the devices, entering into rest of namespaces, and etc.
Y_Init ->> Y_Intermediate : ready message (Unix domain socket)
Y_Intermediate ->> Y_Main : ready message (Unix domain socket)
Y_Main ->> Y_Main: set cgroup configuration for Y_Init
Y_Main ->> D : exit $code
D ->> Y_Main : $ youki start $container_id
Y_Main -->> Y_Init : start message through notify listener (Unix domain socket)
Y_Init ->> Y_Init : run the commands in dockerfile, using `execv`
D ->> D : monitor pid written in pid file
D ->> U : exit $code

```

---
Expand All @@ -59,7 +61,7 @@ One thing to note is that in the end, container is just another process in Linux

When given create command, Youki will load the specification, configuration, sockets etc., use clone syscall to create the container process (init process),applies the limits, namespaces, and etc. to the cloned container process. The container process will wait on a unix domain socket before exec into the command/program.

The main youki process will setup pipes used to communicate and syncronize with the init process. The init process will notify the youki process that it is ready and start to wait on a unix domain socket. The youki process will then write the container state and exit.
The main youki process will setup pipes used to communicate and syncronize with the intermediate and init process. The init process will notify the intermediate process, and then intermediate process to the main youki process that it is ready and start to wait on a unix domain socket. The youki process will then write the container state and exit.

- [mio Token definition](https://docs.rs/mio/0.7.11/mio/struct.Token.html)
- [oom-score-adj](https://dev.to/rrampage/surviving-the-linux-oom-killer-2ki9)
Expand All @@ -69,9 +71,15 @@ The main youki process will setup pipes used to communicate and syncronize with

### Process

This handles creation of the container process. The main youki process creates the container process (init process) using clone syscall. The main youki process will set up pipes used as message passing and synchronization mechanism with the init process. Youki uses clone instead of fork to create the container process. Using clone, Youki can directly pass the namespace creation flag to the syscall. Otherwise, if using fork, Youki would need to fork two processes, the first to enter into usernamespace, and a second time to enter into pid namespace correctly.
This handles creation of the container process. The main youki process creates the intermediate process and the intermediate process creates the container process (init process). The hierarchy is: `main youki process -> intermediate process -> init process`

The main youki process will set up pipes used as message passing and synchronization mechanism with the init process. Youki needs to create/fork two process instead of one is because nuances for the user and pid namespaces. In rootless container, we need to first enter user namespace, since all other namespaces requires CAP_SYSADMIN. When unshare or set_ns into pid namespace, only the children of the current process will enter into a different pid namespace. As a result, we must first fork a process to enter into user namespace, call unshare or set_ns for pid namespace, then fork again to enter into the correct pid namespace.

Note: clone(2) offers us the ability to enter into user and pid namespace by creatng only one process. However, clone(2) can only create new pid namespace, but cannot enter into existing pid namespaces. Therefore, to enter into existing pid namespaces, we would need to fork twice. Currently, there is no getting around this limitation.

- [fork(2) man page](https://man7.org/linux/man-pages/man2/fork.2.html)
- [clone(2) man page](https://man7.org/linux/man-pages/man2/clone.2.html)
- [pid namespace man page](https://man7.org/linux/man-pages/man7/pid_namespaces.7.html)

### Container

Expand Down
110 changes: 76 additions & 34 deletions src/container/builder_impl.rs
Original file line number Diff line number Diff line change
@@ -1,19 +1,17 @@
use anyhow::{Context, Result};
use nix::sched::CloneFlags;

use cgroups;

use oci_spec::Spec;
use std::{fs, os::unix::prelude::RawFd, path::PathBuf};

use crate::{
hooks,
namespaces::Namespaces,
process::{child, fork, init, parent},
process::{channel, fork, init},
rootless::Rootless,
syscall::linux::LinuxSyscall,
utils,
};
use anyhow::{Context, Result};
use cgroups;
use nix::unistd::Pid;
use oci_spec::Spec;
use std::path::Path;
use std::process::Command;
use std::{fs, os::unix::prelude::RawFd, path::PathBuf};

use super::{Container, ContainerStatus};

Expand Down Expand Up @@ -59,16 +57,16 @@ impl<'a> ContainerBuilderImpl<'a> {
let cgroups_path = utils::get_cgroup_path(&linux.cgroups_path, &self.container_id);
let cmanager = cgroups::common::create_cgroup_manager(&cgroups_path, self.use_systemd)?;

// create the parent and child process structure so the parent and child process can sync with each other
let (mut parent, parent_channel) = parent::ParentProcess::new(&self.rootless)?;
let child = child::ChildProcess::new(parent_channel)?;

if self.init {
if let Some(hooks) = self.spec.hooks.as_ref() {
hooks::run_hooks(hooks.create_runtime.as_ref(), self.container.as_ref())?
}
}

// We use a set of channels to communicate between parent and child process. Each channel is uni-directional.
let parent_to_child = &mut channel::Channel::new()?;
let child_to_parent = &mut channel::Channel::new()?;

// This init_args will be passed to the container init process,
// therefore we will have to move all the variable by value. Since self
// is a shared reference, we have to clone these variables here.
Expand All @@ -82,30 +80,25 @@ impl<'a> ContainerBuilderImpl<'a> {
notify_path: self.notify_path.clone(),
preserve_fds: self.preserve_fds,
container: self.container.clone(),
child,
};
let intermediate_pid = fork::container_fork(|| {
init::container_intermidiate(init_args, parent_to_child, child_to_parent)
})?;
// If creating a rootless container, the intermediate process will ask
// the main process to set up uid and gid mapping, once the intermediate
// process enters into a new user namespace.
if self.rootless.is_some() {
child_to_parent.wait_for_mapping_request()?;
log::debug!("write mapping for pid {:?}", intermediate_pid);
utils::write_file(format!("/proc/{}/setgroups", intermediate_pid), "deny")?;
write_uid_mapping(intermediate_pid, self.rootless.as_ref())?;
write_gid_mapping(intermediate_pid, self.rootless.as_ref())?;
parent_to_child.send_mapping_written()?;
}

// We have to box up this closure to correctly pass to the init function
// of the new process.
let cb = Box::new(move || {
if let Err(error) = init::container_init(init_args) {
log::debug!("failed to run container_init: {:?}", error);
return -1;
}

0
});

let clone_flags = linux
.namespaces
.as_ref()
.map(|ns| Namespaces::from(ns).clone_flags)
.unwrap_or_else(CloneFlags::empty);
let init_pid = fork::clone(cb, clone_flags)?;
let init_pid = child_to_parent.wait_for_child_ready()?;
log::debug!("init pid is {:?}", init_pid);

parent.wait_for_child_ready(init_pid)?;

cmanager.add_task(init_pid)?;
if self.rootless.is_none() && linux.resources.is_some() && self.init {
cmanager.apply(linux.resources.as_ref().unwrap())?;
Expand All @@ -128,3 +121,52 @@ impl<'a> ContainerBuilderImpl<'a> {
Ok(())
}
}

fn write_uid_mapping(target_pid: Pid, rootless: Option<&Rootless>) -> Result<()> {
if let Some(rootless) = rootless {
if let Some(uid_mappings) = rootless.gid_mappings {
return write_id_mapping(
&format!("/proc/{}/uid_map", target_pid),
uid_mappings,
rootless.newuidmap.as_deref(),
);
}
}

Ok(())
}

fn write_gid_mapping(target_pid: Pid, rootless: Option<&Rootless>) -> Result<()> {
if let Some(rootless) = rootless {
if let Some(gid_mappings) = rootless.gid_mappings {
return write_id_mapping(
&format!("/proc/{}/gid_map", target_pid),
gid_mappings,
rootless.newgidmap.as_deref(),
);
}
}

Ok(())
}

fn write_id_mapping(
map_file: &str,
mappings: &[oci_spec::LinuxIdMapping],
map_binary: Option<&Path>,
) -> Result<()> {
let mappings: Vec<String> = mappings
.iter()
.map(|m| format!("{} {} {}", m.container_id, m.host_id, m.size))
.collect();
if mappings.len() == 1 {
utils::write_file(map_file, mappings.first().unwrap())?;
} else {
Command::new(map_binary.unwrap())
.args(mappings)
.output()
.with_context(|| format!("failed to execute {:?}", map_binary))?;
}

Ok(())
}
125 changes: 59 additions & 66 deletions src/namespaces.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,86 +8,85 @@
//! Cgroup (Resource limits, execution priority etc.)

use crate::syscall::{syscall::create_syscall, Syscall};
use anyhow::Result;
use nix::{
fcntl,
sched::{self, CloneFlags},
sys::stat,
unistd::{self, Gid, Uid},
};
use oci_spec::LinuxNamespace;
use anyhow::{Context, Result};
use nix::{fcntl, sched::CloneFlags, sys::stat, unistd};
use oci_spec::{LinuxNamespace, LinuxNamespaceType};
use std::collections;

/// Holds information about namespaces
pub struct Namespaces<'a> {
spaces: &'a Vec<LinuxNamespace>,
pub struct Namespaces {
command: Box<dyn Syscall>,
pub clone_flags: CloneFlags,
namespace_map: collections::HashMap<CloneFlags, LinuxNamespace>,
}

impl<'a> From<&'a Vec<LinuxNamespace>> for Namespaces<'a> {
fn from(namespaces: &'a Vec<LinuxNamespace>) -> Self {
let clone_flags = namespaces.iter().filter(|ns| ns.path.is_none()).fold(
CloneFlags::empty(),
|mut cf, ns| {
cf |= CloneFlags::from_bits_truncate(ns.typ as i32);
cf
},
);
fn get_clone_flag(namespace_type: LinuxNamespaceType) -> CloneFlags {
match namespace_type {
LinuxNamespaceType::Pid => CloneFlags::CLONE_NEWPID,
LinuxNamespaceType::User => CloneFlags::CLONE_NEWUSER,
LinuxNamespaceType::Uts => CloneFlags::CLONE_NEWUTS,
LinuxNamespaceType::Cgroup => CloneFlags::CLONE_NEWCGROUP,
LinuxNamespaceType::Ipc => CloneFlags::CLONE_NEWIPC,
LinuxNamespaceType::Network => CloneFlags::CLONE_NEWNET,
LinuxNamespaceType::Mount => CloneFlags::CLONE_NEWNS,
}
}

impl From<Option<&Vec<LinuxNamespace>>> for Namespaces {
fn from(namespaces: Option<&Vec<LinuxNamespace>>) -> Self {
let command: Box<dyn Syscall> = create_syscall();
let namespace_map: collections::HashMap<CloneFlags, LinuxNamespace> = namespaces
.unwrap_or(&vec![])
.iter()
.map(|ns| (get_clone_flag(ns.typ), ns.clone()))
.collect();

Namespaces {
spaces: namespaces,
command,
clone_flags,
namespace_map,
}
}
}

impl<'a> Namespaces<'a> {
pub fn apply_setns(&self) -> Result<()> {
let to_enter: Vec<(CloneFlags, i32)> = self
.spaces
impl Namespaces {
pub fn apply_namespaces<F: Fn(CloneFlags) -> bool>(&self, filter: F) -> Result<()> {
let to_enter: collections::HashMap<&CloneFlags, &LinuxNamespace> = self
.namespace_map
.iter()
.filter(|ns| ns.path.is_some()) // filter those which are actually present on the system
.map(|ns| {
let space = CloneFlags::from_bits_truncate(ns.typ as i32);
let fd = fcntl::open(
&*ns.path.as_ref().unwrap(),
fcntl::OFlag::empty(),
stat::Mode::empty(),
)
.unwrap();
(space, fd)
})
.filter(|(k, _)| filter(**k))
.collect();

for &(space, fd) in &to_enter {
// set the namespace
self.command.set_ns(fd, space)?;
unistd::close(fd)?;
// if namespace is cloned with newuser flag, then it creates a new user namespace,
// and we need to set the user and group id to 0
// see https://man7.org/linux/man-pages/man2/clone.2.html for more info
if space == sched::CloneFlags::CLONE_NEWUSER {
self.command.set_id(Uid::from_raw(0), Gid::from_raw(0))?;
}
for (ns_type, ns) in to_enter {
self.unshare_or_setns(ns)
.with_context(|| format!("Failed to enter {:?} namespace: {:?}", ns_type, ns))?;
}
Ok(())
}

/// disassociate given parts context of calling process from other process
// see https://man7.org/linux/man-pages/man2/unshare.2.html for more info
pub fn apply_unshare(&self, without: CloneFlags) -> Result<()> {
self.command.unshare(self.clone_flags & !without)?;
pub fn unshare_or_setns(&self, namespace: &LinuxNamespace) -> Result<()> {
if namespace.path.is_none() {
self.command.unshare(get_clone_flag(namespace.typ))?;
} else {
let ns_path = namespace.path.as_ref().unwrap();
let fd = fcntl::open(ns_path, fcntl::OFlag::empty(), stat::Mode::empty())
.with_context(|| format!("Failed to open namespace fd: {:?}", ns_path))?;
self.command
.set_ns(fd, get_clone_flag(namespace.typ))
.with_context(|| "Failed to set namespace")?;
unistd::close(fd).with_context(|| "Failed to close namespace fd")?;
}

Ok(())
}

pub fn get(&self, k: LinuxNamespaceType) -> Option<&LinuxNamespace> {
self.namespace_map.get(&get_clone_flag(k))
}
}

#[cfg(test)]
mod tests {
use oci_spec::LinuxNamespaceType;

use super::*;
use crate::syscall::test::TestHelperSyscall;
use oci_spec::LinuxNamespaceType;

fn gen_sample_linux_namespaces() -> Vec<LinuxNamespace> {
vec![
Expand Down Expand Up @@ -115,11 +114,13 @@ mod tests {
}

#[test]
fn test_namespaces_set_ns() {
fn test_apply_namespaces() {
let sample_linux_namespaces = gen_sample_linux_namespaces();
let namespaces = Namespaces::from(&sample_linux_namespaces);
let namespaces = Namespaces::from(Some(&sample_linux_namespaces));
let test_command: &TestHelperSyscall = namespaces.command.as_any().downcast_ref().unwrap();
assert!(namespaces.apply_setns().is_ok());
assert!(namespaces
.apply_namespaces(|ns_type| { ns_type != CloneFlags::CLONE_NEWIPC })
.is_ok());

let mut setns_args: Vec<_> = test_command
.get_setns_args()
Expand All @@ -130,18 +131,10 @@ mod tests {
let mut expect = vec![CloneFlags::CLONE_NEWNS, CloneFlags::CLONE_NEWNET];
expect.sort();
assert_eq!(setns_args, expect);
}

#[test]
fn test_namespaces_unshare() {
let sample_linux_namespaces = gen_sample_linux_namespaces();
let namespaces = Namespaces::from(&sample_linux_namespaces);
assert!(namespaces.apply_unshare(CloneFlags::CLONE_NEWIPC).is_ok());

let test_command: &TestHelperSyscall = namespaces.command.as_any().downcast_ref().unwrap();
let mut unshare_args = test_command.get_unshare_args();
unshare_args.sort();
let mut expect = vec![CloneFlags::CLONE_NEWUSER | CloneFlags::CLONE_NEWPID];
let mut expect = vec![CloneFlags::CLONE_NEWUSER, CloneFlags::CLONE_NEWPID];
expect.sort();
assert_eq!(unshare_args, expect)
}
Expand Down
Loading