1.2.2 will not work with SysBox (error mounting "proc" to rootfs at "/proc": mount src=proc, dst=/proc, dstFd=/proc/thread-self/fd/8, flags=0xe: no such file or directory) #4542

MrPeacockNLB · 2024-12-03T11:51:48Z

Description

We are using SysBox in Azure Kubernetes for Docker usage in a POD. This POD runs a Manjaro Linux with runtime class sysbox-runc. This works fine until I updated the runc package. Last stable version in Manjaro Linux was 1.1.14. This version works without any issue. Manjaro has had a new release last weekend so the version of runc was updated to 1.2.2.

After updating to version 1.2.2 I could not run docker run hello-world. It fails with

[error mounting "proc" to rootfs at "/proc": mount src=proc, dst=/proc, dstFd=/proc/thread-self/fd/8, flags=0xe: no such file or directory](docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "proc" to rootfs at "/proc": mount src=proc, dst=/proc, dstFd=/proc/thread-self/fd/8, flags=0xe: no such file or directory: unknown.)

See: containerd/containerd#11083

Steps to reproduce the issue

updating runc in Container from 1.1.14 up to 1.2.0 breaks docker run hello-world

Describe the results you received and expected

docker run hello-world should run

What version of runc are you using?

1.2.2

Host OS information

NAME="Manjaro Linux"
PRETTY_NAME="Manjaro Linux"
ID=manjaro
ID_LIKE=arch
BUILD_ID=rolling
VERSION_ID=rolling
ANSI_COLOR="32;1;24;144;200"
HOME_URL="https://manjaro.org/"
DOCUMENTATION_URL="https://wiki.manjaro.org/"
SUPPORT_URL="https://forum.manjaro.org/"
BUG_REPORT_URL="https://docs.manjaro.org/reporting-bugs/"
PRIVACY_POLICY_URL="https://manjaro.org/privacy-policy/"
LOGO=manjarolinux

Host kernel information

Azure AKS
K8S 1.29.9
Kernel 5.15.0-1071-azure

The text was updated successfully, but these errors were encountered:

MrPeacockNLB · 2024-12-03T13:21:07Z

I used manjaro-downgrade runc to bisect the latest working version.

1.1.14 OK
1.1.15 OK <-- latest working version
1.2.0 NOK
1.2.1 NOK
1.2.2 NOK

cyphar · 2024-12-03T14:02:12Z

(I haven't yet reproduced this, just adding some information from the other bugs that wasn't mentioned in this report.)

This is related to nested containers, and you're getting this error when running Docker under sysbox (I guess sysbox-runc is being used to create the container that Docker is going to run in?). -ENOENT is a somewhat odd error to get here...

MrPeacockNLB · 2024-12-03T14:06:23Z

yes, we are using runtime sysbox-runc as a runtime class in Kubernetes.

E.g.:

      }
      spec {
        # for docker usage this must be "sysbox-runc"
        runtime_class_name = "sysbox-runc"

MrPeacockNLB · 2024-12-03T14:09:31Z

Running on SysBox: https://github.com/nestybox/sysbox/releases/tag/v0.6.4

bitshop · 2024-12-04T15:57:09Z

As a work around I was able to do this on Ubuntu to revert containerd version back to a working version.

apt install containerd.io=1.6.33-1

pdziuba · 2024-12-04T18:37:16Z

thanks for breaking my CI, I love doing overtime 😘

pickles-bread-and-butter · 2024-12-04T19:35:58Z

If you're on CI consider changing your driver to just docker and not containerized to avoid the docker in docker

- name: Set up Docker Buildx
  uses: docker/setup-buildx-action@v3
  with:
      driver: docker

thaJeztah · 2024-12-04T19:47:03Z

thanks for breaking my CI, I love doing overtime 😘

Happy to hear it failed in CI before you rolled out updates to your production environment.

pdziuba · 2024-12-05T10:55:18Z

Sorry for sounding passive aggressive in previous comment. I really had a long day. Anyway, in my setup pinning version of buildkit helped:

    - name: Set up Docker Buildx
      uses: docker/[email protected]
      with:
        version: v0.17.1
        install: true
        driver-opts: image=moby/buildkit:v0.17.2

thaJeztah · 2024-12-05T12:40:08Z

No worries; we've all been there (pinning versions for situations where you don't want unexpected updates is still recommended though 😅). It's definitely not intentional to break existing setups, but 💩 sometimes happens, and sysbox/nestybox is a bit of a non-standard situation, which is not commonly tested against as part of upstream projects such as runc (or containerd). Perhaps a (scheduled) CI check in the nestybox/sysbox projects to test docker-in-docker with main / nightly builds of runc could be an enhancement to make in that project.

On that matter, I contacted a colleague who's involved in sysbox development on our (docker's) internal Slack to ask if he was aware of this, and/or had some pointers. He's currently occupied with some other work and we have some reduced staffing during the Holidays, so they may not have immediate time to look into this, but he did point to some of the code that would likely be related. Here's his reply;

Yes inside the container, Sysbox intercepts a few syscalls, including mount and umount. If a new version of containerd or runc running inside a Sysbox container is reporting errors while mounting things, it's likely some subtle bug inside the Sysbox mount interception code ... I'll try to repro later today

In case you are curious, the code in Sysbox that processes the mount syscall interception is here: https://github.com/nestybox/sysbox-fs/blob/aeba775e52cc6385fa4807c594fc7ee164ad624c/seccomp/mount.go#L38
Basically, sysbox sets up the kernel's seccomp-notify to trap mount and umount syscalls for processes running inside the container; the kernel traps those and calls back into a user-space function in Sysbox; Sysbox then processes the system call as needed (as if it was the kernel)

The mechanism is used so that when a process inside the container (say runc) mounts procfs into a nested container, Syscall ensures that the procfs that is mounted is one that is emulated by Sysbox itself, rather than the kernel's procfs, as the latter would essentially break isolation.
Same for mounts of sysfs and a few other filesystems.

MrPeacockNLB · 2024-12-06T07:02:30Z

It seems there is a fix downstream on the way: nestybox/sysbox-fs#101

cyphar · 2024-12-06T07:20:43Z

@thaJeztah We should probably give them a heads up when we switch to using fsopen for mounting, as that's going to make their solution stop working as well (though of course it is possible to get it working with SECCOMP_IOCTL_NOTIF_ADDFD and enough trickery).

But yes, it seems (according to nestybox/sysbox-fs#101) that the actual issue is that they weren't emulating /proc/thread-self correctly with their fake procfs, and so any program that used /proc/thread-self would've run into issues as well AFAICS. (Note that if /proc/thread-self had been missing entirely, runc would've been able to gracefully fallback to using /proc/self.)

kolyshkin · 2024-12-06T23:44:27Z

For the reference, the corresponding runc change is added by 8e8b136, part of #3985.

thaJeztah · 2024-12-07T09:01:51Z

Good callout, yes, probably need to keep an eye on that thanks! (cc @ctalledo FYI)

kolyshkin · 2024-12-09T22:52:27Z

I guess we can close this one in favor of nestybox/sysbox#879.

ctalledo · 2024-12-09T23:09:26Z

Thanks folks for the help.

@cyphar, regarding:

We should probably give them a heads up when we switch to using fsopen for mounting, as that's going to make their solution stop working as well (though of course it is possible to get it working with SECCOMP_IOCTL_NOTIF_ADDFD and enough trickery).

Yes correct; any sense on when runc will switch to use fsopen, fsmount, etc., to mount filesystems instead of the older mount syscall?

Thanks!

cyphar · 2024-12-10T05:04:56Z

@ctalledo

I can't give you a definite answer, but now that I think about it again, we will have to keep support for pre-fsopen kernels for a long time, so you will be able to force runc to use the mount fallback by just returning ENOSYS from your seccomp filters.

This work is part of several other bits of related work I plan to work on next year (hopefully in time for runc 1.4):

Porting runc to libpathrs. This includes:
- Migrating all usage of filepath-securejoin to libpathrs.
- Switching all procfs operations to use the safe procfs API from libpathrs.
- Doing an audit for any other filesystem operations that should be protected by libpathrs.
Reworking the mount logic in rootfs_linux.go so that it can use move_mount(2) or mount(2) depending on kernel support. When combined with the libpathrs migration, this will make all of our mounting logic fd-based and thus free from filesystem races entirely (after which I'll be able to breathe a sigh of relief at last).

There is a separate issue though. Both cyphar/filepath-securejoin and libpathrs do verification of the procfs instance they are using internally and once we move to libpathrs's safe procfs API, it may be difficult for sysbox to trick runc into using its alternative procfs (I'm not quite sure how you do it -- is it FUSE-based or is it entirely based on seccomp-notify?). (This move is necessary because we have historically had security issues related to this, and I still feel that our current protections are not sufficient.) One possible complication (if you are using seccomp-notify entirely) is that the procfs operations are all done relative to dirfds and (with the exception of openat2-enabled kernels) the lookup is done component-by-component so there would be significant book-keeping necessary to figure out what procfs file is being opened.

ctalledo · 2024-12-10T18:27:54Z

Hi @cyphar,

Thanks for all the context, much appreciated.

we will have to keep support for pre-fsopen kernels for a long time, so you will be able to force runc to use the mount fallback by just returning ENOSYS from your seccomp filters.

Got it, that's good to know. Nonetheless Sysbox will still need to support fsopen and friends since apps other than runc may start using it to mount stuff, and Sysbox needs to intercept and vet those mounts (and sometimes perform them on behalf of the container process). But it will likely require more bookeeping as you hinted above.

There is a separate issue though ...

Thanks for the heads-up, will need to think how to deal with that. It will certainly make it more challenging, but I am confident we can make it work still.

FYI, Sysbox emulates procfs inside the container using FUSE. But it doesn't emulate the entire procfs; rather it only emulates the portions that are not namespaced by the kernel (e.g., typically some stuff under /proc/sys). So stuff directly under /proc (e.g., /proc/<pid>, /proc/self, /proc/thread-self, etc.) is actually not emulated and follows it's normal kernel processing (thankfully as otherwise things would be slow). In addition, Sysbox traps the mount and umount system calls (using seccomp-notify) to ensure, among other things, that all new procfs mounts inside the container are also emulated by Sysbox (otherwise it breaks isolation).

kolyshkin closed this as completed Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.2.2 will not work with SysBox (error mounting "proc" to rootfs at "/proc": mount src=proc, dst=/proc, dstFd=/proc/thread-self/fd/8, flags=0xe: no such file or directory) #4542

1.2.2 will not work with SysBox (error mounting "proc" to rootfs at "/proc": mount src=proc, dst=/proc, dstFd=/proc/thread-self/fd/8, flags=0xe: no such file or directory) #4542

MrPeacockNLB commented Dec 3, 2024 •

edited

Loading

MrPeacockNLB commented Dec 3, 2024

cyphar commented Dec 3, 2024

MrPeacockNLB commented Dec 3, 2024 •

edited

Loading

MrPeacockNLB commented Dec 3, 2024

bitshop commented Dec 4, 2024

pdziuba commented Dec 4, 2024

pickles-bread-and-butter commented Dec 4, 2024 •

edited

Loading

thaJeztah commented Dec 4, 2024

pdziuba commented Dec 5, 2024

thaJeztah commented Dec 5, 2024

MrPeacockNLB commented Dec 6, 2024

cyphar commented Dec 6, 2024 •

edited

Loading

kolyshkin commented Dec 6, 2024

thaJeztah commented Dec 7, 2024

kolyshkin commented Dec 9, 2024

ctalledo commented Dec 9, 2024 •

edited

Loading

cyphar commented Dec 10, 2024 •

edited

Loading

ctalledo commented Dec 10, 2024 •

edited

Loading

1.2.2 will not work with SysBox (error mounting "proc" to rootfs at "/proc": mount src=proc, dst=/proc, dstFd=/proc/thread-self/fd/8, flags=0xe: no such file or directory) #4542

1.2.2 will not work with SysBox (error mounting "proc" to rootfs at "/proc": mount src=proc, dst=/proc, dstFd=/proc/thread-self/fd/8, flags=0xe: no such file or directory) #4542

Comments

MrPeacockNLB commented Dec 3, 2024 • edited Loading

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of runc are you using?

Host OS information

Host kernel information

MrPeacockNLB commented Dec 3, 2024

cyphar commented Dec 3, 2024

MrPeacockNLB commented Dec 3, 2024 • edited Loading

MrPeacockNLB commented Dec 3, 2024

bitshop commented Dec 4, 2024

pdziuba commented Dec 4, 2024

pickles-bread-and-butter commented Dec 4, 2024 • edited Loading

thaJeztah commented Dec 4, 2024

pdziuba commented Dec 5, 2024

thaJeztah commented Dec 5, 2024

MrPeacockNLB commented Dec 6, 2024

cyphar commented Dec 6, 2024 • edited Loading

kolyshkin commented Dec 6, 2024

thaJeztah commented Dec 7, 2024

kolyshkin commented Dec 9, 2024

ctalledo commented Dec 9, 2024 • edited Loading

cyphar commented Dec 10, 2024 • edited Loading

ctalledo commented Dec 10, 2024 • edited Loading

MrPeacockNLB commented Dec 3, 2024 •

edited

Loading

MrPeacockNLB commented Dec 3, 2024 •

edited

Loading

pickles-bread-and-butter commented Dec 4, 2024 •

edited

Loading

cyphar commented Dec 6, 2024 •

edited

Loading

ctalledo commented Dec 9, 2024 •

edited

Loading

cyphar commented Dec 10, 2024 •

edited

Loading

ctalledo commented Dec 10, 2024 •

edited

Loading