Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runsc inside of default docker seccomp policy #4371

Open
prattmic opened this issue Sep 28, 2020 · 9 comments
Open

runsc inside of default docker seccomp policy #4371

prattmic opened this issue Sep 28, 2020 · 9 comments
Assignees
Labels
area: container runtime Issue related to docker, kubernetes, OCI runtime area: usability Issue related to usability type: enhancement New feature or request

Comments

@prattmic
Copy link
Member

@scanlime on Twitter is trying to run runsc inside a Docker container with the standard seccomp policy enabled. This is similar to rootless mode (#311), but a little bit more strict.

The immediate issue is that we exec into empty namespaces, which the profile does not allow. It is not clear if there would be more issues if that were resolved, though I didn't see any glaring issues comparing our seccomp filters to Docker's.

It's also not clear if the defense-in-depth features we'd have to disable to make this work would make it a bad idea. But in general, it is very reasonable to want to run a sandbox as a subprocess in an existing container.

cc @fvoznika @nlacasse

@prattmic prattmic added type: enhancement New feature or request area: container runtime Issue related to docker, kubernetes, OCI runtime area: usability Issue related to usability labels Sep 28, 2020
@ghost
Copy link

ghost commented Sep 28, 2020

Hi, prattmic invited me here to explain a bit further. Thanks for opening the issue. It's still in the experimental stages but I'm working on a media server that will be doing transcodes in a locked down environment, and I've been investigating which approaches are possible to adopt without adding a portability burden. For wide use i would really want the resulting project to run in a docker container, and adding extra privileges to that container for the promise of better sandboxing within the container is really a non-starter. So that has me looking at either forking gvisor and collapsing it to the bare minimum really, or writing a tiny seccomp and ldpreload thing just for this purpose.

@fvoznika
Copy link
Member

That's an interesting use case. gVisor setups up many layers of defense around the sandbox to reduce access to host in case the sandbox is compromised. You can find more details in the Containing a Real Vulnerability blog post. Configuring some of these layers require CAP_SYS_ADMIN that are not available to Docker containers unless docker run --privileged flag is used. This is unintuitive, but in the same way that Docker requires to be running as root to create containers, gVisor needs root to protect the sandbox as well.

So you can remove some of the layers that require high privilege (e.g. namespaces, pivot_root), similar to the way --rootless does. You just need to be aware that in the end the sandbox will not be as protected. You need to decide whether this protection is good enough for your use case. On the positive side, the most important security layer is seccomp, which you can still use.

Another consideration is that the gofer requires these capabilities to function correctly. Otherwise, some operations will not be allowed, like creating files with different owners.

A few other options to consider are:

  1. Run everything inside gVisor: this works if you don't need to isolate media server from the code that would be running inside gVisor
  2. Use different containers: run the media server in one container and untrusted workload in runsc. You can connect then using container network. This would require giving the media server access to the Docker unix socket to run containers.

@avagin
Copy link
Collaborator

avagin commented Sep 29, 2020

avagin@1924e9e

This is the POC patch which adds the --unprivileged flag. With this flag, the gVisor can be executed in a default docker container with limitations that Fabricio described in the previous comment.

$ docker run -it --rm -v /tmp/runsc:/mnt alpine /bin/sh
/ # 
/ # /mnt/runsc --rootless --unprivileged --network none do echo 'Hello World!'
Hello World!

If the unprivileged flag is specified, Sentry and Gofer processes are running in the current set of namespace. By other words, we remove one level of isolation. But in your case, a docker container provides you this extra level, so I think your use-case can be still valid.

@ghost
Copy link

ghost commented Oct 4, 2020

This is the POC patch which adds the --unprivileged flag. With this flag, the gVisor can be executed in a default docker container with limitations that Fabricio described in the previous comment.

This is interesting! How do filesystem and PID isolation work in this case? It looks from my untrained reading that this would allow access to send signals to other processes owned by the same user, or open files accessible to the current user. Does the emulated kernel provide that level of isolation?

@prattmic
Copy link
Member Author

prattmic commented Oct 4, 2020

Yes, the userspace kernel still provides isolation, because it is emulating the OS based on the provided configuration. These capabilities are providing defense-in-depth in the event that the kernel is compromised.

In the case of signals, all thread IDs within the sandbox are entirely internal to the userspace kernel, with no relation to the host. Signal syscalls sent by the sandboxed application can only target other threads in the sandbox (implementing a signal syscall in the userspace kernel may not even send a host signal at all). In fact, using a PID namespace is really a third layer of defense, since the userspace kernel can't send signals to other processes anyways.

Similarly, the userspace kernel can't directly open host files. That is mediated by another process called the gofer. The gofer won't grant access to files not allowed by the configuration, but were it to be compromised, the mount namespace containing only the configured files provides an additional layer of protection.

@teng1
Copy link

teng1 commented Nov 1, 2020

This is interesting, slightly different use-case I was also trying to use runsc inside a container in an attempt to get a dockerised (proprietary) application to work on a hardened K8S platform, the application requires some privileged capabilities to work however the container platform drops all privileged capabilities for tenants for security reasons.

The idea was to use gvisor in a pod to (in a crude sense) pick these capabilities back up as a compatibility layer for the app. Would such a thing be feasible?

@fvoznika
Copy link
Member

fvoznika commented Nov 3, 2020

It's not a generic solution that will work with all containers, but it may work in your case depending on what the container does at runtime. In gVisor, all file system operations are handled by an external file proxy, called Gofer, that is isolated from the sandbox for security purposes. The gofer requires capabilities to function correctly. For example, when the container creates a file running as an user that exists inside the sandbox, the gofer requires CAP_CHOWN to change the file owner after creation.

@paulfitz
Copy link

paulfitz commented Mar 7, 2022

Is there any possibility the --unprivileged flag from @avagin's POC could be added to mainstream gvisor? It comes in handy from time to time, for the kind of use-case outlined at start of thread. I could imagine perhaps fearing that people would turn it on without understanding the trade-offs.

@LarsSven
Copy link

For my project I must also run gVisor inside a docker container for integration-testing purposes. It is not possible to do this outside of a docker container, as there are other requirements the environment has like specific file system mounting and Linux-specific tools, while the test environment must be triggered locally from systems that don't have gVisor installed, file systems mounted, or are even Linux.

Is there any way the unprivileged flag could somehow be merged or updated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: container runtime Issue related to docker, kubernetes, OCI runtime area: usability Issue related to usability type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants