Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support making /sys read-write #1474

Open
AkihiroSuda opened this issue Apr 14, 2020 · 18 comments
Open

Support making /sys read-write #1474

AkihiroSuda opened this issue Apr 14, 2020 · 18 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. priority/backlog Higher priority than priority/awaiting-more-evidence.

Comments

@AkihiroSuda
Copy link
Member

What would you like to be added:

echo 'INFO: remounting /sys read-only'
# systemd-in-a-container should have read only /sys
# https://www.freedesktop.org/wiki/Software/systemd/ContainerInterface/
# however, we need other things from `docker run --privileged` ...
# and this flag also happens to make /sys rw, amongst other things
mount -o remount,ro /sys

kind makes /sys on the kind-control-plane intentionally read-only.
This should be optionally configurable to read-write.

The code comment cites https://www.freedesktop.org/wiki/Software/systemd/ContainerInterface/ as the reason for making /sys read-only, but it should not always apply to privileged containers.

Why is this needed:
Anbox requires writable /sys: aind-containers/aind#21

@AkihiroSuda AkihiroSuda added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 14, 2020
@BenTheElder
Copy link
Member

We had issues with systemd trying to do things it shouldn't when /sys was writeable.

specifically this part:

udev is not available in containers (and refuses to start), and hence device dependencies are unavailable. The udev unit files will check for /sys being read-only, as an indication whether device management can work. Hence make sure to mount /sys read-only in the container (see above).

This is desirable. Not having this caused issues with docker for mac / xhyve and I imagine could cause more issues. systemd/systemd#9740

Surely a priv container could remount it as rw later anyhow?

@AkihiroSuda
Copy link
Member Author

udev itself could be just disabled?

@BenTheElder
Copy link
Member

BenTheElder commented Apr 14, 2020

Sure, but systemd broadly relies on checking for ro /sys. As long as that expecation is true, I don't think this is a good idea.

Make sure to pre-mount /sys, and /proc, /sys/fs/selinux before invoking systemd, and mount /proc/sys and the entirety of /sys and /sys/fs/selinux read-only in order to avoid that the container can alter the host kernel's configuration settings. systemd and various tools (such as the selinux ones) have been modified to detect whether these file systems are read-only, and will behave accordingly.

@BenTheElder
Copy link
Member

Kubernetes will also try to configure conntrack if sys is writeable, not sure what else.

Maybe with aind a custom config could be used with extraMounts to pass through a rw /sys under some other path (with a minor tweak to the deployment of aind then).

I'm still a bit hesitant about how many footguns I expect this to open, but having the option to do it seems somewhat reasonable..

@threeseed
Copy link

Would definitely like the option for it to be read/write.

udev doesn't start properly unless sys is writable. And if udev doesn't start then OpenEBS is not usable on kind.

@BenTheElder
Copy link
Member

BenTheElder commented May 23, 2020 via email

@BenTheElder
Copy link
Member

Er and see above: per systemd udev should not run in containers

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 21, 2020
@BenTheElder BenTheElder removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 12, 2020
@kubernetes-sigs kubernetes-sigs deleted a comment from fejta-bot Sep 12, 2020
@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 11, 2020
@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 10, 2021
@BenTheElder BenTheElder removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 20, 2021
@kubernetes-sigs kubernetes-sigs deleted a comment from fejta-bot Jan 20, 2021
@kubernetes-sigs kubernetes-sigs deleted a comment from fejta-bot Jan 20, 2021
@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 20, 2021
@kubernetes-sigs kubernetes-sigs deleted a comment from fejta-bot May 18, 2021
@BenTheElder BenTheElder added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 18, 2021
@BenTheElder
Copy link
Member

We've shipped explicitly disabling udev which is not supported to be run inside containers (and really shouldn't be running trying to setup the host devices!), but we do not yet provide an option for making sys rw. Currently the available approach is to use a daemonset or exec to the docker nodes and make it rw yourself after startup.

@BenTheElder
Copy link
Member

I read through https://systemd.io/CONTAINER_INTERFACE/ again today for a different reason, it currently states:

  1. systemd-udevd is not available in containers (and refuses to start), and hence device dependencies are unavailable. The systemd-udevd unit files will check for /sys/ being read-only, as an indication whether device management can work. Therefore make sure to mount /sys/ read-only in the container (see above). Various clients of systemd-udevd also check the read-only state of /sys/, including PID 1 itself and systemd-networkd.

(emphasis mine)

It's left vague as to what these other clients do based on this state, but it seems clear that this expectation is not scoped only to udev

@BenTheElder
Copy link
Member

Further down it also states:

Do not make /sys/ writable in the container. If you do, systemd-udevd.service is started to manage your devices — inside the container, but that will cause conflicts and errors given that the Linux device model is not virtualized for containers on Linux and thus the containers and the host would try to manage the same devices, fighting for ownership. Multiple other subsystems of systemd similarly test for /sys/ being writable to decide whether to use systemd-udevd or assume that device management is properly available on the instance. Among them systemd-networkd and systemd-logind. The conditionalization on the read-only state of /sys/ enables a nice automatism: as soon as /sys/ and the Linux device model are changed to be virtualized properly the container payload can make use of that, simply by marking /sys/ writable. (Note that as special exception, the devices in /sys/class/net/ are virtualized already, if network namespacing is used. Thus it is OK to mount the relevant sub-directories of /sys/ writable, but make sure to leave the root of /sys/ read-only.)

Perhaps there are specific subdirectories desired to be writable for specific use-cases?

@BenTheElder BenTheElder added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jul 29, 2021
@tiagolobocastro
Copy link

Hi, I tried adding specific subdirectories but seems not work.

Here's what I did:

  extraMounts:
    - hostPath: /sys/class/nvme
      containerPath: /sys/class/nvme
    - hostPath: /sys/module/nvme_core
      containerPath: /sys/module/nvme_core

But on the csi driver I still get an error:

ERROR csi_node::node: Failed to unstage volume f77345f2-2a28-45d1-80c4-a3ec91b69c20: failed to detach device /dev/nvme1n1: File IO error: /sys/class/nvme/nvme1/delete_controller, Read-only file system (os error 30

And indeed if I try it on the kind-worker I get similar error:

❯ docker exec -it kind-worker /bin/sh                                                                                                                                                                    ─╯
# ls
bin  boot  dev	etc  home  kind  lib  lib32  lib64  libx32  media  mnt	opt  proc  root  run  sbin  srv  sys  tmp  usr	var
# mount | grep sys
sysfs on /sys/class/nvme type sysfs (rw,nosuid,nodev,noexec,relatime)
sysfs on /sys/module/nvme_core type sysfs (rw,nosuid,nodev,noexec,relatime)
# ls /sys/class/nvme/nvme0
address  cntrltype	dctype		   dev	   fast_io_fail_tmo  hostid   hwmon0  model	 nvme0c0n1  queue_count      rescan_controller	serial	state	   subsystem  uevent
cntlid	 ctrl_loss_tmo	delete_controller  device  firmware_rev      hostnqn  kato    numa_node  power	    reconnect_delay  reset_controller	sqsize	subsysnqn  transport
# echo 1 > /sys/class/nvme/nvme0/delete_controller
/bin/sh: 4: cannot create /sys/class/nvme/nvme0/delete_controller: Read-only file system

Am I doing something wrong? Strange that mount if rw but still seems we can't write to it..

@BenTheElder
Copy link
Member

The above comment is asking if we have use cases for specific subdirectories, not suggesting that mounting them with extra mounts would work.

We set have to set /sys to ro during the entrypoint before we start systemd, we can't tell docker to mount it as ro for a privileged container.

@tiagolobocastro
Copy link

Ah sorry @BenTheElder I misunderstood..
If the root of /sys is to remain RO then I have a specific use case then:
Being able to test a csi-driver in kind, which requires write-access to /sys/class/nvme and /sys/module/nvme_core.

@BenTheElder
Copy link
Member

It should be preferable to test the driver against a fake e.g. using a loopback device because the test environment may not even have nvme anyhow?

We make zero guarantees about what the kind nodes contain except that they can run a given kubernetes version, we've switched distros and other details before, BUT, you could remount these paths rw yourself after startup e.g. with docker exec $node ...

@tiagolobocastro
Copy link

It should be preferable to test the driver against a fake e.g. using a loopback device because the test environment may not even have nvme anyhow?

In our specific case it's the csi-driver which is exposing an nvme-tcp device on the host where it is running.

We make zero guarantees about what the kind nodes contain except that they can run a given kubernetes version

We ensure our systems have the required kernel modules, etc possibly even running kind inside a thin headless vm, making it light and simple to run k8s-specific tests..

BUT, you could remount these paths rw yourself after startup e.g. with

Ah yes this might work, I'll try it tomorrow.

@BenTheElder
Copy link
Member

We ensure our systems have the required kernel modules, etc possibly even running kind inside a thin headless vm, making it light and simple to run k8s-specific tests..

kubeadm in the VM may be more appropriate for this use-case, FWIW.

@tiagolobocastro
Copy link

kubeadm in the VM may be more appropriate for this use-case, FWIW.

Yeah we have that, just looking to kind as a simpler lighter approach to have a multi-node cluster.

Changing /sys to rw by simply docker exec does the trick, thank you kindly :)
A more fine grained control would probably be "preferred" but for now this is good enough for me to carry on evaluating this alternative, thanks!

@BenTheElder
Copy link
Member

This is kind of a deep implementation detail and it's highly not recommended to be controlling the actual host from the kind node. Having /sys not be rw is a requirement from systemd to run in a container for this reason.

Glad the workaround solves your use case though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

No branches or pull requests

5 participants