runsc inside of default docker seccomp policy #4371

prattmic · 2020-09-28T18:45:28Z

@scanlime on Twitter is trying to run runsc inside a Docker container with the standard seccomp policy enabled. This is similar to rootless mode (#311), but a little bit more strict.

The immediate issue is that we exec into empty namespaces, which the profile does not allow. It is not clear if there would be more issues if that were resolved, though I didn't see any glaring issues comparing our seccomp filters to Docker's.

It's also not clear if the defense-in-depth features we'd have to disable to make this work would make it a bad idea. But in general, it is very reasonable to want to run a sandbox as a subprocess in an existing container.

cc @fvoznika @nlacasse

ghost · 2020-09-28T19:04:43Z

Hi, prattmic invited me here to explain a bit further. Thanks for opening the issue. It's still in the experimental stages but I'm working on a media server that will be doing transcodes in a locked down environment, and I've been investigating which approaches are possible to adopt without adding a portability burden. For wide use i would really want the resulting project to run in a docker container, and adding extra privileges to that container for the promise of better sandboxing within the container is really a non-starter. So that has me looking at either forking gvisor and collapsing it to the bare minimum really, or writing a tiny seccomp and ldpreload thing just for this purpose.

fvoznika · 2020-09-29T18:42:58Z

That's an interesting use case. gVisor setups up many layers of defense around the sandbox to reduce access to host in case the sandbox is compromised. You can find more details in the Containing a Real Vulnerability blog post. Configuring some of these layers require CAP_SYS_ADMIN that are not available to Docker containers unless docker run --privileged flag is used. This is unintuitive, but in the same way that Docker requires to be running as root to create containers, gVisor needs root to protect the sandbox as well.

So you can remove some of the layers that require high privilege (e.g. namespaces, pivot_root), similar to the way --rootless does. You just need to be aware that in the end the sandbox will not be as protected. You need to decide whether this protection is good enough for your use case. On the positive side, the most important security layer is seccomp, which you can still use.

Another consideration is that the gofer requires these capabilities to function correctly. Otherwise, some operations will not be allowed, like creating files with different owners.

A few other options to consider are:

Run everything inside gVisor: this works if you don't need to isolate media server from the code that would be running inside gVisor
Use different containers: run the media server in one container and untrusted workload in runsc. You can connect then using container network. This would require giving the media server access to the Docker unix socket to run containers.

avagin · 2020-09-29T18:59:38Z

avagin@1924e9e

This is the POC patch which adds the --unprivileged flag. With this flag, the gVisor can be executed in a default docker container with limitations that Fabricio described in the previous comment.

$ docker run -it --rm -v /tmp/runsc:/mnt alpine /bin/sh
/ # 
/ # /mnt/runsc --rootless --unprivileged --network none do echo 'Hello World!'
Hello World!

If the unprivileged flag is specified, Sentry and Gofer processes are running in the current set of namespace. By other words, we remove one level of isolation. But in your case, a docker container provides you this extra level, so I think your use-case can be still valid.

ghost · 2020-10-04T00:47:19Z

This is the POC patch which adds the --unprivileged flag. With this flag, the gVisor can be executed in a default docker container with limitations that Fabricio described in the previous comment.

This is interesting! How do filesystem and PID isolation work in this case? It looks from my untrained reading that this would allow access to send signals to other processes owned by the same user, or open files accessible to the current user. Does the emulated kernel provide that level of isolation?

prattmic · 2020-10-04T05:00:39Z

Yes, the userspace kernel still provides isolation, because it is emulating the OS based on the provided configuration. These capabilities are providing defense-in-depth in the event that the kernel is compromised.

In the case of signals, all thread IDs within the sandbox are entirely internal to the userspace kernel, with no relation to the host. Signal syscalls sent by the sandboxed application can only target other threads in the sandbox (implementing a signal syscall in the userspace kernel may not even send a host signal at all). In fact, using a PID namespace is really a third layer of defense, since the userspace kernel can't send signals to other processes anyways.

Similarly, the userspace kernel can't directly open host files. That is mediated by another process called the gofer. The gofer won't grant access to files not allowed by the configuration, but were it to be compromised, the mount namespace containing only the configured files provides an additional layer of protection.

teng1 · 2020-11-01T22:23:43Z

This is interesting, slightly different use-case I was also trying to use runsc inside a container in an attempt to get a dockerised (proprietary) application to work on a hardened K8S platform, the application requires some privileged capabilities to work however the container platform drops all privileged capabilities for tenants for security reasons.

The idea was to use gvisor in a pod to (in a crude sense) pick these capabilities back up as a compatibility layer for the app. Would such a thing be feasible?

fvoznika · 2020-11-03T19:15:46Z

It's not a generic solution that will work with all containers, but it may work in your case depending on what the container does at runtime. In gVisor, all file system operations are handled by an external file proxy, called Gofer, that is isolated from the sandbox for security purposes. The gofer requires capabilities to function correctly. For example, when the container creates a file running as an user that exists inside the sandbox, the gofer requires CAP_CHOWN to change the file owner after creation.

paulfitz · 2022-03-07T23:11:27Z

Is there any possibility the --unprivileged flag from @avagin's POC could be added to mainstream gvisor? It comes in handy from time to time, for the kind of use-case outlined at start of thread. I could imagine perhaps fearing that people would turn it on without understanding the trade-offs.

LarsSven · 2023-08-30T14:02:59Z

For my project I must also run gVisor inside a docker container for integration-testing purposes. It is not possible to do this outside of a docker container, as there are other requirements the environment has like specific file system mounting and Linux-specific tools, while the test environment must be triggered locally from systems that don't have gVisor installed, file systems mounted, or are even Linux.

Is there any way the unprivileged flag could somehow be merged or updated?

prattmic added type: enhancement New feature or request area: container runtime Issue related to docker, kubernetes, OCI runtime area: usability Issue related to usability labels Sep 28, 2020

negz mentioned this issue Mar 30, 2022

Prototype rootless container based composition functions crossplane/crossplane#3001

Closed

avagin self-assigned this Sep 7, 2023

paulfitz mentioned this issue Nov 19, 2023

Enable gvisor by default; exit docker if any server fails; log more visibly. gristlabs/grist-omnibus#11

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runsc inside of default docker seccomp policy #4371

runsc inside of default docker seccomp policy #4371

prattmic commented Sep 28, 2020

ghost commented Sep 28, 2020

fvoznika commented Sep 29, 2020

avagin commented Sep 29, 2020

ghost commented Oct 4, 2020

prattmic commented Oct 4, 2020

teng1 commented Nov 1, 2020 •

edited

Loading

fvoznika commented Nov 3, 2020

paulfitz commented Mar 7, 2022

LarsSven commented Aug 30, 2023

runsc inside of default docker seccomp policy #4371

runsc inside of default docker seccomp policy #4371

Comments

prattmic commented Sep 28, 2020

ghost commented Sep 28, 2020

fvoznika commented Sep 29, 2020

avagin commented Sep 29, 2020

ghost commented Oct 4, 2020

prattmic commented Oct 4, 2020

teng1 commented Nov 1, 2020 • edited Loading

fvoznika commented Nov 3, 2020

paulfitz commented Mar 7, 2022

LarsSven commented Aug 30, 2023

teng1 commented Nov 1, 2020 •

edited

Loading