Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing /sys/fs/cgroup/cpuacct,cpu #145

Closed
ashcrow opened this issue Jun 27, 2018 · 21 comments
Closed

Missing /sys/fs/cgroup/cpuacct,cpu #145

ashcrow opened this issue Jun 27, 2018 · 21 comments
Assignees
Labels

Comments

@ashcrow
Copy link
Member

ashcrow commented Jun 27, 2018

@crawford has found in tests that /sys/fs/cgroup/cpuacct,cpu is being expected during his testing but RHCOS provides /sys/fs/cgroup/cpu,cpuacct.

kubernetes/kubernetes#32728 (comment) denotes a similar issue. The workaround is to setup a link from one to the other.

@ashcrow ashcrow added enhancement good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. jira labels Jun 27, 2018
@crawford
Copy link
Contributor

Note that this reversal only happens within docker. Outside of docker, I see /sys/fs/cgroup/cpuacct,cpu.

@ashcrow
Copy link
Member Author

ashcrow commented Jun 28, 2018

@crawford interesting. Thanks for adding that.

@derekwaynecarr does this truly look like the same issue you had seen before?

@ashcrow
Copy link
Member Author

ashcrow commented Jul 2, 2018

@crawford / @derekwaynecarr:

Do we have a good understanding of how hard this will be to fix? I know @derekwaynecarr noted he's looked at this before and thought it has been fixed already.

@ashcrow ashcrow removed the good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. label Jul 2, 2018
@ashcrow
Copy link
Member Author

ashcrow commented Jul 3, 2018

To notify those on this issue work on trying to identify the issue has started.

@Bubblemelon
Copy link

Bubblemelon commented Jul 6, 2018

Here's what I found so far with Docker version 1.13.1 RHEL7 to see if the error persists there:

Setup

$ vagrant box add --name RHCOS rhcos-vagrant-libvirt.box 
$ mdkir rhcos && cd rhcos && vagrant init RHCOS && vagrant up
$ vagrant ssh

Link to Vagrant box binary: http://aos-ostree.rhev-ci-vms.eng.rdu2.redhat.com/rhcos/images/cloud/latest/

RPM Overlaying

$ sudo ostree admin unlock --hotfix
$ rpm -qa | grep docker 

Docker version 1.13.1-70 RHEL7

$ sudo rpm-ostree override replace *.rpm  
$ sudo rpm-ostree status -v
$ sudo reboot

Ran the following commands to start the Kublet:

Commands source

/usr/bin/docker \
    run \
      --rm \
      --net host \
      --pid host \
      --privileged \
      --volume /dev:/dev:rw \
      --volume /sys:/sys:ro \
      --volume /var/run:/var/run:rw \
      --volume /var/lib/cni/:/var/lib/cni:rw \
      --volume /var/lib/docker/:/var/lib/docker:rw \
      --volume /var/lib/kubelet/:/var/lib/kubelet:shared \
      --volume /var/log:/var/log:shared \
      --volume /etc/kubernetes:/etc/kubernetes:ro \
      --entrypoint /usr/bin/hyperkube \
    "openshift/origin-node" \
      kubelet \
        --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \
        --kubeconfig=/var/lib/kubelet/kubeconfig \
        --rotate-certificates \
        --cni-conf-dir=/etc/kubernetes/cni/net.d \
        --cni-bin-dir=/var/lib/cni/bin \
        --network-plugin=cni \
        --lock-file=/var/run/lock/kubelet.lock \
        --exit-on-lock-contention \
        --pod-manifest-path=/etc/kubernetes/manifests \
        --allow-privileged \
        --node-labels=node-role.kubernetes.io/master \
        --minimum-container-ttl-duration=6m0s \
        --cluster-dns=10.3.0.10 \
        --cluster-domain=cluster.local \
        --client-ca-file=/etc/kubernetes/ca.crt \
        --anonymous-auth=false \
        --register-with-taints=node-role.kubernetes.io/master=:NoSchedule \

Which gave me the following output:

Flag --pod-manifest-path has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --allow-privileged has been deprecated, will be removed in a future version
Flag --minimum-container-ttl-duration has been deprecated, Use --eviction-hard or --eviction-soft instead. Will be removed in a future version.                                                                   
Flag --cluster-dns has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --cluster-domain has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/
for more information.
Flag --anonymous-auth has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/
for more information.
I0705 23:32:18.768616    2186 feature_gate.go:230] feature gates: &{map[]}
I0705 23:32:18.768902    2186 feature_gate.go:230] feature gates: &{map[]}
I0705 23:32:19.036018    2186 server.go:415] Version: v1.11.0+d4cacc0
I0705 23:32:19.036062    2186 feature_gate.go:230] feature gates: &{map[]}
I0705 23:32:19.036110    2186 feature_gate.go:230] feature gates: &{map[]}
I0705 23:32:19.036123    2186 server.go:493] acquiring file lock on "/var/run/lock/kubelet.lock"
I0705 23:32:19.036150    2186 server.go:498] watching for inotify events for: /var/run/lock/kubelet.lock
I0705 23:32:19.036262    2186 plugins.go:97] No cloud provider specified.
W0705 23:32:19.036290    2186 server.go:556] standalone mode, no API client
F0705 23:32:19.036300    2186 server.go:262] failed to run Kubelet: No authentication method configured

So it looks the error with cgroup isn't showing with this docker version, unless my reproduction steps are incorrect.

Updated: look at comments below, the tests stated in this comment was insufficient to identify the problem

@Bubblemelon
Copy link

I've encountered an error when Mounting NFS shared folders, i.e. at /vagrant and running exportfs -a -v doesn't change anything. @cgwalters may have already fixed this error as suggested by @peterbaouoft

The full error log in this gist.

@ashcrow
Copy link
Member Author

ashcrow commented Jul 6, 2018

@Bubblemelon this makes me wonder if the fix was applied at build time via a patch. It may be worth using rpmdev-extract to take a look at the contents of the SRPM and see what patches (if any) are applied.

@Bubblemelon
Copy link

Using this Libvirt howto guide to verify the assumptions in my above comment about docker's cgroup driver:

Master Node Info

RHCOS version: source

[core@coreos-220-master-0 ~]$ rpm-ostree status -v
State: idle; auto updates disabled
Deployments:
● ostree://rhcos:openshift/3.10/x86_64/os
                   Version: 3.10-7.5.235 (2018-07-06 22:41:39)
                    Commit: f51faab9a702e0d85905f3edc81641a63c9ec3c8acf0319e52d03de03de67e5f
                            └─ atomic-centos-continuous (2018-07-06 20:45:09)
                            └─ dustymabe-ignition (2018-07-03 00:29:34)
                            └─ rhcos-continuous (2018-07-06 19:25:38)
                            └─ rhel-7.5-server (2018-05-02 10:10:39)
                            └─ rhel-7.5-server-optional (2018-05-02 10:06:54)
                            └─ rhel-7.5-server-extras (2018-05-02 13:57:35)
                            └─ rhel-7.5-atomic (2017-07-11 17:45:34)
                            └─ openshift (2018-07-06 21:46:14)
                    Staged: no
                 StateRoot: rhcos

Docker Version: 2018-04-30 15:56:58

[core@coreos-220-master-0 ~]$ rpm -qa | grep docker
docker-client-1.13.1-63.git94f4240.el7.x86_64
docker-rhel-push-plugin-1.13.1-63.git94f4240.el7.x86_64
docker-common-1.13.1-63.git94f4240.el7.x86_64
docker-1.13.1-63.git94f4240.el7.x86_64
docker-novolume-plugin-1.13.1-63.git94f4240.el7.x86_64
docker-lvm-plugin-1.13.1-63.git94f4240.el7.x86_64

Output from $ journalctl -u docker:

Jul 09 17:51:20 coreos-220-master-0 dockerd-current[1145]: F0709 17:51:20.232817   25049 server.go:262] failed to run Kubelet: 
failed to create kubelet: misconfiguration: 
kubelet cgroup driver: "cgroupfs" is different from docker cgroup driver: "systemd"                                                                                                                                                                        

Jul 09 17:51:20 coreos-220-master-0 dockerd-current[1145]: time="2018-07-09T17:51:20.267459141Z" level=error msg="containerd: 
deleting container" error="exit status 1: \"container b85300b2eee4b379bec5753361f37e1
1bcb8cacdd7c4aa6c9179d62eb93ab001 does not exist\\none or more of the container deletions failed\\n\""                                                                                                             

Jul 09 17:51:20 coreos-220-master-0 dockerd-current[1145]: time="2018-07-09T17:51:20.298990686Z" level=warning msg="b85300b2eee4b379bec5753361f37e11bcb8cacdd7c4aa6c9179d62eb93ab001 cleanup: failed to unmount sec
rets: invalid argument"

In trying to resolve cgroupfs is different from docker cgroup driver: systemd error:

I found this openshift issue #18776:

To place

ExecStart=/usr/bin/dockerd \
          --exec-opt native.cgroupdriver=systemd 

within docker.service. However the /usr directory is read only and docker.service already contains the following:

[core@coreos-220-master-0 system]$ cat docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=network.target rhel-push-plugin.socket registries.service
Wants=docker-storage-setup.service
Requires=rhel-push-plugin.socket registries.service
Requires=docker-cleanup.timer

[Service]
Type=notify
NotifyAccess=all
EnvironmentFile=-/run/containers/registries.conf
EnvironmentFile=-/etc/sysconfig/docker
EnvironmentFile=-/etc/sysconfig/docker-storage
EnvironmentFile=-/etc/sysconfig/docker-network
Environment=GOTRACEBACK=crash
Environment=DOCKER_HTTP_HOST_COMPAT=1
Environment=PATH=/usr/libexec/docker:/usr/bin:/usr/sbin
ExecStart=/usr/bin/dockerd-current \
          --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current \
          --default-runtime=docker-runc \
          --authorization-plugin=rhel-push-plugin \
          --exec-opt native.cgroupdriver=systemd \         <--------------
          --userland-proxy-path=/usr/libexec/docker/docker-proxy-current \
          --init-path=/usr/libexec/docker/docker-init-current \
          --seccomp-profile=/etc/docker/seccomp.json \
          $OPTIONS \
          $DOCKER_STORAGE_OPTIONS \
          $DOCKER_NETWORK_OPTIONS \
          $ADD_REGISTRY \
          $BLOCK_REGISTRY \
          $INSECURE_REGISTRY \
          $REGISTRIES
ExecReload=/bin/kill -s HUP $MAINPID
LimitNOFILE=1048576
LimitNPROC=1048576
LimitCORE=infinity
TimeoutStartSec=0
Restart=on-abnormal
KillMode=process

[Install]
WantedBy=multi-user.target

@cgwalters
Copy link
Member

Related: coreos/bugs#1435

@Bubblemelon
Copy link

Bubblemelon commented Jul 9, 2018

The error above

failed to create kubelet: misconfiguration: 
kubelet cgroup driver: "cgroupfs" is different from docker cgroup driver: "systemd" 

can be resolved by adding --cgroup-driver=systemd \ to kubelet.service:

[Unit]
Description=Kubernetes Kubelet
...

[Service]
...

ExecStart=/usr/bin/docker \
  run \
    .
    .
  "openshift/origin-node:latest" \
    kubelet \
      .
      .
      . 
      --cgroup-driver=systemd \

After running, sudo systemctl daemon-reload && sudo systemctl restart kubelet:

journalctl -u docker and journalctl -u kubelet shows the same output:

kubelet.go:1769] skipping pod synchronization - [container runtime is down]                                                       
kubelet_node_status.go:269] Setting node annotation to enable volume controller attach/detach                                     
kubelet.go:1312] Failed to start cAdvisor inotify_add_watch /sys/fs/cgroup/cpuacct,cpu: no such file or directory                 
kubelet.service: main process exited, code=exited, status=255/n/a                                                                                                 
Unit kubelet.service entered failed state.
kubelet.service failed.
kubelet_node_status.go:79] Attempting to register node coreos-220-master-0                                                       
kubelet.go:1312] Failed to start cAdvisor inotify_add_watch /sys/fs/cgroup/cpuacct,cpu: no such file or directory                
kubelet.service: main process exited, code=exited, status=255/n/a                                                                                                 
Unit kubelet.service entered failed state.
kubelet.service failed.
$ rpm -qa | grep docker
docker-client-1.13.1-63.git94f4240.el7.x86_64
docker-rhel-push-plugin-1.13.1-63.git94f4240.el7.x86_64
docker-common-1.13.1-63.git94f4240.el7.x86_64
docker-1.13.1-63.git94f4240.el7.x86_64
docker-novolume-plugin-1.13.1-63.git94f4240.el7.x86_64
docker-lvm-plugin-1.13.1-63.git94f4240.el7.x86_64

@ashcrow
Copy link
Member Author

ashcrow commented Jul 9, 2018

Great work debugging @Bubblemelon!

@Bubblemelon
Copy link

Bubblemelon commented Jul 9, 2018

Also thank you @crawford for helping me!

Just to clarify, something on the kubelet side is causing the Failed to start cAdvisor inotify_add_watch /sys/fs/cgroup/cpuacct,cpu: no such file or directory

I've also tried it out with this docker version: source - Sun, 08 Jul 2018 09:39:40 UT

docker-1.13.1-72.git6f36bd4.el7.x86_64
docker-rhel-push-plugin-1.13.1-72.git6f36bd4.el7.x86_64
docker-client-1.13.1-72.git6f36bd4.el7.x86_64
docker-lvm-plugin-1.13.1-72.git6f36bd4.el7.x86_64
docker-common-1.13.1-72.git6f36bd4.el7.x86_64
docker-novolume-plugin-1.13.1-72.git6f36bd4.el7.x86_64

Which gave the same error.

@Bubblemelon
Copy link

Bubblemelon commented Jul 9, 2018

Like to note that openshift/origin-node:latest i.e. openshift v3.11.0-alpha.0+90e2736-260 is running Kubernetes v1.11.0+d4cacc0.

That version of kubelet should include this fix

@Bubblemelon
Copy link

@derekwaynecarr what are your thoughts on this?

@mrunalp
Copy link
Member

mrunalp commented Jul 17, 2018

cadivor doesn't like /sys:/sys:ro. See google/cadvisor#1843

@Bubblemelon
Copy link

Bubblemelon commented Jul 17, 2018

This same error,

kubelet.go:1312] Failed to start cAdvisor inotify_add_watch /sys/fs/cgroup/cpuacct,cpu: no such file or directory

Still occurs when /sys is changed to read and write within the kubelet.service file.

.
.
ExecStart=/usr/bin/docker \
  run \
 .
 .
 --volume /sys:/sys:rw \
.

Note that on RHCOS, the file is in this format: /sys/fs/cgroup/cpu,cpuacct

If both of these were added, under ExecStart=/usr/bin/docker \

    --volume /sys:/sys:rw \
    --volume=/sys/fs/cgroup/cpu,cpuacct:/sys/fs/cgroup/cpuacct,cpu:rw \

This error would occur:

kubelet.service holdoff time over, scheduling restart.
Starting Kubernetes Kubelet...
Started Kubernetes Kubelet.
container_linux.go:247: starting container process caused "process_linux.go:364: container init caused 
\"rootfs_linux.go:54: mounting \\\"/sys/fs/cgroup/cpu,cpuacct\\\" to rootfs 
\\\"/var/lib/docker/overlay2/8c95a16f4cad1f014091093c62248c6c0f27bcde879606cef6220f7db4521708/
merged\\\" at \\\"/var/lib/docker/overlay2/8c95a16f4cad1f014091093c62248c6c0f27bcde879606cef6220f7db4521708/
merged/sys/fs/cgroup/cpuacct,cpu\\\" caused \\\"no space left on device\\\"\""
 /usr/bin/docker-current: Error response from daemon: oci runtime error: Failed to remove paths: 
map[cpu:/sys/fs/cgroup/cpu,cpuacct/system.slice/docker-afc3a2d6c323ed28a6c7e6586239cb4db8b79b591513eb229ca6fa1eb0bead3b.scope 
cpuacct:/sys/fs/cgroup/cpu,cpuacct/system.slice/docker-afc3a2d6c323ed28a6c7e6586239cb4db8b79b591513eb229ca6fa1eb0bead3b.scope].

@ashcrow
Copy link
Member Author

ashcrow commented Jul 23, 2018

@crawford do you mind stating what priority you think this should have? Or if the workaround in use should be applied in the RHCOS spins itself? This would clarify if @Bubblemelon and @mrunalp should keep digging on this specific issue.

@crawford
Copy link
Contributor

This needs to be fixed in the Kubelet. If the OS team is going to tackle that, then I think this bug should stay. Otherwise, let's close this and let @derekwaynecarr and his team tackle the issue. Either way, this is a low priority. I have a workaround (it's ugly, but it works).

@ashcrow
Copy link
Member Author

ashcrow commented Jul 23, 2018

Since this is kubelet related we should pass it over to @derekwaynecarr's team and link back to this issue so they don't have to re-do all of the good debugging done so far.

@Bubblemelon
Copy link

Bubblemelon commented Jul 23, 2018

Moved this issue over to openshift/origin

@ashcrow
Copy link
Member Author

ashcrow commented Jul 23, 2018

Closing since the fix must be done in another codebase.

@ashcrow ashcrow closed this as completed Jul 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants