Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.2.2 will not work with SysBox (error mounting "proc" to rootfs at "/proc": mount src=proc, dst=/proc, dstFd=/proc/thread-self/fd/8, flags=0xe: no such file or directory) #4542

Closed
MrPeacockNLB opened this issue Dec 3, 2024 · 18 comments

Comments

@MrPeacockNLB
Copy link

MrPeacockNLB commented Dec 3, 2024

Description

We are using SysBox in Azure Kubernetes for Docker usage in a POD. This POD runs a Manjaro Linux with runtime class sysbox-runc. This works fine until I updated the runc package. Last stable version in Manjaro Linux was 1.1.14. This version works without any issue. Manjaro has had a new release last weekend so the version of runc was updated to 1.2.2.

After updating to version 1.2.2 I could not run docker run hello-world. It fails with

[error mounting "proc" to rootfs at "/proc": mount src=proc, dst=/proc, dstFd=/proc/thread-self/fd/8, flags=0xe: no such file or directory](docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "proc" to rootfs at "/proc": mount src=proc, dst=/proc, dstFd=/proc/thread-self/fd/8, flags=0xe: no such file or directory: unknown.)

See: containerd/containerd#11083

Steps to reproduce the issue

  1. updating runc in Container from 1.1.14 up to 1.2.0 breaks docker run hello-world

Describe the results you received and expected

docker run hello-world should run

What version of runc are you using?

1.2.2

Host OS information

NAME="Manjaro Linux"
PRETTY_NAME="Manjaro Linux"
ID=manjaro
ID_LIKE=arch
BUILD_ID=rolling
VERSION_ID=rolling
ANSI_COLOR="32;1;24;144;200"
HOME_URL="https://manjaro.org/"
DOCUMENTATION_URL="https://wiki.manjaro.org/"
SUPPORT_URL="https://forum.manjaro.org/"
BUG_REPORT_URL="https://docs.manjaro.org/reporting-bugs/"
PRIVACY_POLICY_URL="https://manjaro.org/privacy-policy/"
LOGO=manjarolinux

Host kernel information

Azure AKS
K8S 1.29.9
Kernel 5.15.0-1071-azure

@MrPeacockNLB
Copy link
Author

I used manjaro-downgrade runc to bisect the latest working version.

1.1.14 OK
1.1.15 OK <-- latest working version
1.2.0 NOK
1.2.1 NOK
1.2.2 NOK

@cyphar
Copy link
Member

cyphar commented Dec 3, 2024

(I haven't yet reproduced this, just adding some information from the other bugs that wasn't mentioned in this report.)

This is related to nested containers, and you're getting this error when running Docker under sysbox (I guess sysbox-runc is being used to create the container that Docker is going to run in?). -ENOENT is a somewhat odd error to get here...

@MrPeacockNLB
Copy link
Author

MrPeacockNLB commented Dec 3, 2024

yes, we are using runtime sysbox-runc as a runtime class in Kubernetes.

E.g.:

      }
      spec {
        # for docker usage this must be "sysbox-runc"
        runtime_class_name = "sysbox-runc"

@MrPeacockNLB
Copy link
Author

@bitshop
Copy link

bitshop commented Dec 4, 2024

As a work around I was able to do this on Ubuntu to revert containerd version back to a working version.

apt install containerd.io=1.6.33-1

@pdziuba
Copy link

pdziuba commented Dec 4, 2024

thanks for breaking my CI, I love doing overtime 😘

@pickles-bread-and-butter
Copy link

pickles-bread-and-butter commented Dec 4, 2024

If you're on CI consider changing your driver to just docker and not containerized to avoid the docker in docker

- name: Set up Docker Buildx
  uses: docker/setup-buildx-action@v3
  with:
      driver: docker

@thaJeztah
Copy link
Member

thanks for breaking my CI, I love doing overtime 😘

Happy to hear it failed in CI before you rolled out updates to your production environment.

@pdziuba
Copy link

pdziuba commented Dec 5, 2024

Sorry for sounding passive aggressive in previous comment. I really had a long day. Anyway, in my setup pinning version of buildkit helped:

    - name: Set up Docker Buildx
      uses: docker/[email protected]
      with:
        version: v0.17.1
        install: true
        driver-opts: image=moby/buildkit:v0.17.2

@thaJeztah
Copy link
Member

No worries; we've all been there (pinning versions for situations where you don't want unexpected updates is still recommended though 😅). It's definitely not intentional to break existing setups, but 💩 sometimes happens, and sysbox/nestybox is a bit of a non-standard situation, which is not commonly tested against as part of upstream projects such as runc (or containerd). Perhaps a (scheduled) CI check in the nestybox/sysbox projects to test docker-in-docker with main / nightly builds of runc could be an enhancement to make in that project.

On that matter, I contacted a colleague who's involved in sysbox development on our (docker's) internal Slack to ask if he was aware of this, and/or had some pointers. He's currently occupied with some other work and we have some reduced staffing during the Holidays, so they may not have immediate time to look into this, but he did point to some of the code that would likely be related. Here's his reply;

Yes inside the container, Sysbox intercepts a few syscalls, including mount and umount. If a new version of containerd or runc running inside a Sysbox container is reporting errors while mounting things, it's likely some subtle bug inside the Sysbox mount interception code ... I'll try to repro later today

In case you are curious, the code in Sysbox that processes the mount syscall interception is here: https://github.com/nestybox/sysbox-fs/blob/aeba775e52cc6385fa4807c594fc7ee164ad624c/seccomp/mount.go#L38
Basically, sysbox sets up the kernel's seccomp-notify to trap mount and umount syscalls for processes running inside the container; the kernel traps those and calls back into a user-space function in Sysbox; Sysbox then processes the system call as needed (as if it was the kernel)

The mechanism is used so that when a process inside the container (say runc) mounts procfs into a nested container, Syscall ensures that the procfs that is mounted is one that is emulated by Sysbox itself, rather than the kernel's procfs, as the latter would essentially break isolation.
Same for mounts of sysfs and a few other filesystems.

@MrPeacockNLB
Copy link
Author

It seems there is a fix downstream on the way: nestybox/sysbox-fs#101

@cyphar
Copy link
Member

cyphar commented Dec 6, 2024

@thaJeztah We should probably give them a heads up when we switch to using fsopen for mounting, as that's going to make their solution stop working as well (though of course it is possible to get it working with SECCOMP_IOCTL_NOTIF_ADDFD and enough trickery).

But yes, it seems (according to nestybox/sysbox-fs#101) that the actual issue is that they weren't emulating /proc/thread-self correctly with their fake procfs, and so any program that used /proc/thread-self would've run into issues as well AFAICS. (Note that if /proc/thread-self had been missing entirely, runc would've been able to gracefully fallback to using /proc/self.)

@kolyshkin
Copy link
Contributor

For the reference, the corresponding runc change is added by 8e8b136, part of #3985.

@thaJeztah
Copy link
Member

Good callout, yes, probably need to keep an eye on that thanks! (cc @ctalledo FYI)

@kolyshkin
Copy link
Contributor

I guess we can close this one in favor of nestybox/sysbox#879.

@ctalledo
Copy link
Contributor

ctalledo commented Dec 9, 2024

Thanks folks for the help.

@cyphar, regarding:

We should probably give them a heads up when we switch to using fsopen for mounting, as that's going to make their solution stop working as well (though of course it is possible to get it working with SECCOMP_IOCTL_NOTIF_ADDFD and enough trickery).

Yes correct; any sense on when runc will switch to use fsopen, fsmount, etc., to mount filesystems instead of the older mount syscall?

Thanks!

@cyphar
Copy link
Member

cyphar commented Dec 10, 2024

@ctalledo

I can't give you a definite answer, but now that I think about it again, we will have to keep support for pre-fsopen kernels for a long time, so you will be able to force runc to use the mount fallback by just returning ENOSYS from your seccomp filters.

This work is part of several other bits of related work I plan to work on next year (hopefully in time for runc 1.4):

  1. Porting runc to libpathrs. This includes:
    • Migrating all usage of filepath-securejoin to libpathrs.
    • Switching all procfs operations to use the safe procfs API from libpathrs.
    • Doing an audit for any other filesystem operations that should be protected by libpathrs.
  2. Reworking the mount logic in rootfs_linux.go so that it can use move_mount(2) or mount(2) depending on kernel support. When combined with the libpathrs migration, this will make all of our mounting logic fd-based and thus free from filesystem races entirely (after which I'll be able to breathe a sigh of relief at last).

There is a separate issue though. Both cyphar/filepath-securejoin and libpathrs do verification of the procfs instance they are using internally and once we move to libpathrs's safe procfs API, it may be difficult for sysbox to trick runc into using its alternative procfs (I'm not quite sure how you do it -- is it FUSE-based or is it entirely based on seccomp-notify?). (This move is necessary because we have historically had security issues related to this, and I still feel that our current protections are not sufficient.) One possible complication (if you are using seccomp-notify entirely) is that the procfs operations are all done relative to dirfds and (with the exception of openat2-enabled kernels) the lookup is done component-by-component so there would be significant book-keeping necessary to figure out what procfs file is being opened.

@ctalledo
Copy link
Contributor

ctalledo commented Dec 10, 2024

Hi @cyphar,

Thanks for all the context, much appreciated.

we will have to keep support for pre-fsopen kernels for a long time, so you will be able to force runc to use the mount fallback by just returning ENOSYS from your seccomp filters.

Got it, that's good to know. Nonetheless Sysbox will still need to support fsopen and friends since apps other than runc may start using it to mount stuff, and Sysbox needs to intercept and vet those mounts (and sometimes perform them on behalf of the container process). But it will likely require more bookeeping as you hinted above.

There is a separate issue though ...

Thanks for the heads-up, will need to think how to deal with that. It will certainly make it more challenging, but I am confident we can make it work still.

FYI, Sysbox emulates procfs inside the container using FUSE. But it doesn't emulate the entire procfs; rather it only emulates the portions that are not namespaced by the kernel (e.g., typically some stuff under /proc/sys). So stuff directly under /proc (e.g., /proc/<pid>, /proc/self, /proc/thread-self, etc.) is actually not emulated and follows it's normal kernel processing (thankfully as otherwise things would be slow). In addition, Sysbox traps the mount and umount system calls (using seccomp-notify) to ensure, among other things, that all new procfs mounts inside the container are also emulated by Sysbox (otherwise it breaks isolation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants