Missing /sys/fs/cgroup/cpuacct,cpu #145

ashcrow · 2018-06-27T20:26:10Z

@crawford has found in tests that /sys/fs/cgroup/cpuacct,cpu is being expected during his testing but RHCOS provides /sys/fs/cgroup/cpu,cpuacct.

kubernetes/kubernetes#32728 (comment) denotes a similar issue. The workaround is to setup a link from one to the other.

The text was updated successfully, but these errors were encountered:

crawford · 2018-06-28T12:47:01Z

Note that this reversal only happens within docker. Outside of docker, I see /sys/fs/cgroup/cpuacct,cpu.

ashcrow · 2018-06-28T13:18:35Z

@crawford interesting. Thanks for adding that.

@derekwaynecarr does this truly look like the same issue you had seen before?

ashcrow · 2018-07-02T21:05:52Z

@crawford / @derekwaynecarr:

Do we have a good understanding of how hard this will be to fix? I know @derekwaynecarr noted he's looked at this before and thought it has been fixed already.

ashcrow · 2018-07-03T20:21:09Z

To notify those on this issue work on trying to identify the issue has started.

Bubblemelon · 2018-07-06T19:05:08Z

Here's what I found so far with Docker version 1.13.1 RHEL7 to see if the error persists there:

Setup

$ vagrant box add --name RHCOS rhcos-vagrant-libvirt.box 
$ mdkir rhcos && cd rhcos && vagrant init RHCOS && vagrant up
$ vagrant ssh

Link to Vagrant box binary: http://aos-ostree.rhev-ci-vms.eng.rdu2.redhat.com/rhcos/images/cloud/latest/

RPM Overlaying

$ sudo ostree admin unlock --hotfix
$ rpm -qa | grep docker

Docker version 1.13.1-70 RHEL7

$ sudo rpm-ostree override replace *.rpm  
$ sudo rpm-ostree status -v
$ sudo reboot

Ran the following commands to start the Kublet:

Commands source

/usr/bin/docker \
    run \
      --rm \
      --net host \
      --pid host \
      --privileged \
      --volume /dev:/dev:rw \
      --volume /sys:/sys:ro \
      --volume /var/run:/var/run:rw \
      --volume /var/lib/cni/:/var/lib/cni:rw \
      --volume /var/lib/docker/:/var/lib/docker:rw \
      --volume /var/lib/kubelet/:/var/lib/kubelet:shared \
      --volume /var/log:/var/log:shared \
      --volume /etc/kubernetes:/etc/kubernetes:ro \
      --entrypoint /usr/bin/hyperkube \
    "openshift/origin-node" \
      kubelet \
        --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \
        --kubeconfig=/var/lib/kubelet/kubeconfig \
        --rotate-certificates \
        --cni-conf-dir=/etc/kubernetes/cni/net.d \
        --cni-bin-dir=/var/lib/cni/bin \
        --network-plugin=cni \
        --lock-file=/var/run/lock/kubelet.lock \
        --exit-on-lock-contention \
        --pod-manifest-path=/etc/kubernetes/manifests \
        --allow-privileged \
        --node-labels=node-role.kubernetes.io/master \
        --minimum-container-ttl-duration=6m0s \
        --cluster-dns=10.3.0.10 \
        --cluster-domain=cluster.local \
        --client-ca-file=/etc/kubernetes/ca.crt \
        --anonymous-auth=false \
        --register-with-taints=node-role.kubernetes.io/master=:NoSchedule \

Which gave me the following output:

Flag --pod-manifest-path has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --allow-privileged has been deprecated, will be removed in a future version
Flag --minimum-container-ttl-duration has been deprecated, Use --eviction-hard or --eviction-soft instead. Will be removed in a future version.                                                                   
Flag --cluster-dns has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Flag --cluster-domain has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/
for more information.
Flag --anonymous-auth has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/
for more information.
I0705 23:32:18.768616    2186 feature_gate.go:230] feature gates: &{map[]}
I0705 23:32:18.768902    2186 feature_gate.go:230] feature gates: &{map[]}
I0705 23:32:19.036018    2186 server.go:415] Version: v1.11.0+d4cacc0
I0705 23:32:19.036062    2186 feature_gate.go:230] feature gates: &{map[]}
I0705 23:32:19.036110    2186 feature_gate.go:230] feature gates: &{map[]}
I0705 23:32:19.036123    2186 server.go:493] acquiring file lock on "/var/run/lock/kubelet.lock"
I0705 23:32:19.036150    2186 server.go:498] watching for inotify events for: /var/run/lock/kubelet.lock
I0705 23:32:19.036262    2186 plugins.go:97] No cloud provider specified.
W0705 23:32:19.036290    2186 server.go:556] standalone mode, no API client
F0705 23:32:19.036300    2186 server.go:262] failed to run Kubelet: No authentication method configured

So it looks the error with cgroup isn't showing with this docker version, unless my reproduction steps are incorrect.

Updated: look at comments below, the tests stated in this comment was insufficient to identify the problem

Bubblemelon · 2018-07-06T19:06:55Z

I've encountered an error when Mounting NFS shared folders, i.e. at /vagrant and running exportfs -a -v doesn't change anything. @cgwalters may have already fixed this error as suggested by @peterbaouoft

The full error log in this gist.

ashcrow · 2018-07-06T20:11:18Z

@Bubblemelon this makes me wonder if the fix was applied at build time via a patch. It may be worth using rpmdev-extract to take a look at the contents of the SRPM and see what patches (if any) are applied.

Bubblemelon · 2018-07-09T18:08:17Z

Using this Libvirt howto guide to verify the assumptions in my above comment about docker's cgroup driver:

Master Node Info

RHCOS version: source

[core@coreos-220-master-0 ~]$ rpm-ostree status -v
State: idle; auto updates disabled
Deployments:
● ostree://rhcos:openshift/3.10/x86_64/os
                   Version: 3.10-7.5.235 (2018-07-06 22:41:39)
                    Commit: f51faab9a702e0d85905f3edc81641a63c9ec3c8acf0319e52d03de03de67e5f
                            └─ atomic-centos-continuous (2018-07-06 20:45:09)
                            └─ dustymabe-ignition (2018-07-03 00:29:34)
                            └─ rhcos-continuous (2018-07-06 19:25:38)
                            └─ rhel-7.5-server (2018-05-02 10:10:39)
                            └─ rhel-7.5-server-optional (2018-05-02 10:06:54)
                            └─ rhel-7.5-server-extras (2018-05-02 13:57:35)
                            └─ rhel-7.5-atomic (2017-07-11 17:45:34)
                            └─ openshift (2018-07-06 21:46:14)
                    Staged: no
                 StateRoot: rhcos

Docker Version: 2018-04-30 15:56:58

[core@coreos-220-master-0 ~]$ rpm -qa | grep docker
docker-client-1.13.1-63.git94f4240.el7.x86_64
docker-rhel-push-plugin-1.13.1-63.git94f4240.el7.x86_64
docker-common-1.13.1-63.git94f4240.el7.x86_64
docker-1.13.1-63.git94f4240.el7.x86_64
docker-novolume-plugin-1.13.1-63.git94f4240.el7.x86_64
docker-lvm-plugin-1.13.1-63.git94f4240.el7.x86_64

Output from $ journalctl -u docker:

Jul 09 17:51:20 coreos-220-master-0 dockerd-current[1145]: F0709 17:51:20.232817   25049 server.go:262] failed to run Kubelet: 
failed to create kubelet: misconfiguration: 
kubelet cgroup driver: "cgroupfs" is different from docker cgroup driver: "systemd"                                                                                                                                                                        

Jul 09 17:51:20 coreos-220-master-0 dockerd-current[1145]: time="2018-07-09T17:51:20.267459141Z" level=error msg="containerd: 
deleting container" error="exit status 1: \"container b85300b2eee4b379bec5753361f37e1
1bcb8cacdd7c4aa6c9179d62eb93ab001 does not exist\\none or more of the container deletions failed\\n\""                                                                                                             

Jul 09 17:51:20 coreos-220-master-0 dockerd-current[1145]: time="2018-07-09T17:51:20.298990686Z" level=warning msg="b85300b2eee4b379bec5753361f37e11bcb8cacdd7c4aa6c9179d62eb93ab001 cleanup: failed to unmount sec
rets: invalid argument"

In trying to resolve cgroupfs is different from docker cgroup driver: systemd error:

I found this openshift issue #18776:

To place

ExecStart=/usr/bin/dockerd \
          --exec-opt native.cgroupdriver=systemd

within docker.service. However the /usr directory is read only and docker.service already contains the following:

[core@coreos-220-master-0 system]$ cat docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=network.target rhel-push-plugin.socket registries.service
Wants=docker-storage-setup.service
Requires=rhel-push-plugin.socket registries.service
Requires=docker-cleanup.timer

[Service]
Type=notify
NotifyAccess=all
EnvironmentFile=-/run/containers/registries.conf
EnvironmentFile=-/etc/sysconfig/docker
EnvironmentFile=-/etc/sysconfig/docker-storage
EnvironmentFile=-/etc/sysconfig/docker-network
Environment=GOTRACEBACK=crash
Environment=DOCKER_HTTP_HOST_COMPAT=1
Environment=PATH=/usr/libexec/docker:/usr/bin:/usr/sbin
ExecStart=/usr/bin/dockerd-current \
          --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current \
          --default-runtime=docker-runc \
          --authorization-plugin=rhel-push-plugin \
          --exec-opt native.cgroupdriver=systemd \         <--------------
          --userland-proxy-path=/usr/libexec/docker/docker-proxy-current \
          --init-path=/usr/libexec/docker/docker-init-current \
          --seccomp-profile=/etc/docker/seccomp.json \
          $OPTIONS \
          $DOCKER_STORAGE_OPTIONS \
          $DOCKER_NETWORK_OPTIONS \
          $ADD_REGISTRY \
          $BLOCK_REGISTRY \
          $INSECURE_REGISTRY \
          $REGISTRIES
ExecReload=/bin/kill -s HUP $MAINPID
LimitNOFILE=1048576
LimitNPROC=1048576
LimitCORE=infinity
TimeoutStartSec=0
Restart=on-abnormal
KillMode=process

[Install]
WantedBy=multi-user.target

cgwalters · 2018-07-09T18:13:56Z

Related: coreos/bugs#1435

Bubblemelon · 2018-07-09T20:01:41Z

The error above

failed to create kubelet: misconfiguration: 
kubelet cgroup driver: "cgroupfs" is different from docker cgroup driver: "systemd"

can be resolved by adding --cgroup-driver=systemd \ to kubelet.service:

[Unit]
Description=Kubernetes Kubelet
...

[Service]
...

ExecStart=/usr/bin/docker \
  run \
    .
    .
  "openshift/origin-node:latest" \
    kubelet \
      .
      .
      . 
      --cgroup-driver=systemd \

After running, sudo systemctl daemon-reload && sudo systemctl restart kubelet:

journalctl -u docker and journalctl -u kubelet shows the same output:

kubelet.go:1769] skipping pod synchronization - [container runtime is down]                                                       
kubelet_node_status.go:269] Setting node annotation to enable volume controller attach/detach                                     
kubelet.go:1312] Failed to start cAdvisor inotify_add_watch /sys/fs/cgroup/cpuacct,cpu: no such file or directory                 
kubelet.service: main process exited, code=exited, status=255/n/a                                                                                                 
Unit kubelet.service entered failed state.
kubelet.service failed.

kubelet_node_status.go:79] Attempting to register node coreos-220-master-0                                                       
kubelet.go:1312] Failed to start cAdvisor inotify_add_watch /sys/fs/cgroup/cpuacct,cpu: no such file or directory                
kubelet.service: main process exited, code=exited, status=255/n/a                                                                                                 
Unit kubelet.service entered failed state.
kubelet.service failed.

$ rpm -qa | grep docker
docker-client-1.13.1-63.git94f4240.el7.x86_64
docker-rhel-push-plugin-1.13.1-63.git94f4240.el7.x86_64
docker-common-1.13.1-63.git94f4240.el7.x86_64
docker-1.13.1-63.git94f4240.el7.x86_64
docker-novolume-plugin-1.13.1-63.git94f4240.el7.x86_64
docker-lvm-plugin-1.13.1-63.git94f4240.el7.x86_64

ashcrow · 2018-07-09T20:16:02Z

Great work debugging @Bubblemelon!

Bubblemelon · 2018-07-09T21:11:54Z

Also thank you @crawford for helping me!

Just to clarify, something on the kubelet side is causing the Failed to start cAdvisor inotify_add_watch /sys/fs/cgroup/cpuacct,cpu: no such file or directory

I've also tried it out with this docker version: source - Sun, 08 Jul 2018 09:39:40 UT

docker-1.13.1-72.git6f36bd4.el7.x86_64
docker-rhel-push-plugin-1.13.1-72.git6f36bd4.el7.x86_64
docker-client-1.13.1-72.git6f36bd4.el7.x86_64
docker-lvm-plugin-1.13.1-72.git6f36bd4.el7.x86_64
docker-common-1.13.1-72.git6f36bd4.el7.x86_64
docker-novolume-plugin-1.13.1-72.git6f36bd4.el7.x86_64

Which gave the same error.

Bubblemelon · 2018-07-09T23:29:24Z

Like to note that openshift/origin-node:latest i.e. openshift v3.11.0-alpha.0+90e2736-260 is running Kubernetes v1.11.0+d4cacc0.

That version of kubelet should include this fix

Bubblemelon · 2018-07-10T18:28:11Z

@derekwaynecarr what are your thoughts on this?

mrunalp · 2018-07-17T20:49:18Z

cadivor doesn't like /sys:/sys:ro. See google/cadvisor#1843

Bubblemelon · 2018-07-17T22:28:56Z

This same error,

kubelet.go:1312] Failed to start cAdvisor inotify_add_watch /sys/fs/cgroup/cpuacct,cpu: no such file or directory

Still occurs when /sys is changed to read and write within the kubelet.service file.

.
.
ExecStart=/usr/bin/docker \
  run \
 .
 .
 --volume /sys:/sys:rw \
.

Note that on RHCOS, the file is in this format: /sys/fs/cgroup/cpu,cpuacct

If both of these were added, under ExecStart=/usr/bin/docker \

    --volume /sys:/sys:rw \
    --volume=/sys/fs/cgroup/cpu,cpuacct:/sys/fs/cgroup/cpuacct,cpu:rw \

This error would occur:

kubelet.service holdoff time over, scheduling restart.
Starting Kubernetes Kubelet...
Started Kubernetes Kubelet.
container_linux.go:247: starting container process caused "process_linux.go:364: container init caused 
\"rootfs_linux.go:54: mounting \\\"/sys/fs/cgroup/cpu,cpuacct\\\" to rootfs 
\\\"/var/lib/docker/overlay2/8c95a16f4cad1f014091093c62248c6c0f27bcde879606cef6220f7db4521708/
merged\\\" at \\\"/var/lib/docker/overlay2/8c95a16f4cad1f014091093c62248c6c0f27bcde879606cef6220f7db4521708/
merged/sys/fs/cgroup/cpuacct,cpu\\\" caused \\\"no space left on device\\\"\""
 /usr/bin/docker-current: Error response from daemon: oci runtime error: Failed to remove paths: 
map[cpu:/sys/fs/cgroup/cpu,cpuacct/system.slice/docker-afc3a2d6c323ed28a6c7e6586239cb4db8b79b591513eb229ca6fa1eb0bead3b.scope 
cpuacct:/sys/fs/cgroup/cpu,cpuacct/system.slice/docker-afc3a2d6c323ed28a6c7e6586239cb4db8b79b591513eb229ca6fa1eb0bead3b.scope].

ashcrow · 2018-07-23T12:51:56Z

@crawford do you mind stating what priority you think this should have? Or if the workaround in use should be applied in the RHCOS spins itself? This would clarify if @Bubblemelon and @mrunalp should keep digging on this specific issue.

crawford · 2018-07-23T18:08:45Z

This needs to be fixed in the Kubelet. If the OS team is going to tackle that, then I think this bug should stay. Otherwise, let's close this and let @derekwaynecarr and his team tackle the issue. Either way, this is a low priority. I have a workaround (it's ugly, but it works).

ashcrow · 2018-07-23T18:26:10Z

Since this is kubelet related we should pass it over to @derekwaynecarr's team and link back to this issue so they don't have to re-do all of the good debugging done so far.

Bubblemelon · 2018-07-23T18:52:07Z

Moved this issue over to openshift/origin

ashcrow · 2018-07-23T18:57:19Z

Closing since the fix must be done in another codebase.

ashcrow added enhancement good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. jira labels Jun 27, 2018

ashcrow removed the good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. label Jul 2, 2018

ashcrow assigned Bubblemelon Jul 3, 2018

Bubblemelon mentioned this issue Jul 23, 2018

Missing /sys/fs/cgroup/cpuacct,cpu openshift/origin#20398

Closed

ashcrow closed this as completed Jul 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing /sys/fs/cgroup/cpuacct,cpu #145

Missing /sys/fs/cgroup/cpuacct,cpu #145

ashcrow commented Jun 27, 2018

crawford commented Jun 28, 2018

ashcrow commented Jun 28, 2018

ashcrow commented Jul 2, 2018

ashcrow commented Jul 3, 2018

Bubblemelon commented Jul 6, 2018 •

edited

Loading

Bubblemelon commented Jul 6, 2018

ashcrow commented Jul 6, 2018

Bubblemelon commented Jul 9, 2018

cgwalters commented Jul 9, 2018

Bubblemelon commented Jul 9, 2018 •

edited

Loading

ashcrow commented Jul 9, 2018

Bubblemelon commented Jul 9, 2018 •

edited

Loading

Bubblemelon commented Jul 9, 2018 •

edited

Loading

Bubblemelon commented Jul 10, 2018

mrunalp commented Jul 17, 2018

Bubblemelon commented Jul 17, 2018 •

edited

Loading

ashcrow commented Jul 23, 2018

crawford commented Jul 23, 2018

ashcrow commented Jul 23, 2018 •

edited

Loading

Bubblemelon commented Jul 23, 2018 •

edited

Loading

ashcrow commented Jul 23, 2018

Missing /sys/fs/cgroup/cpuacct,cpu #145

Missing /sys/fs/cgroup/cpuacct,cpu #145

Comments

ashcrow commented Jun 27, 2018

crawford commented Jun 28, 2018

ashcrow commented Jun 28, 2018

ashcrow commented Jul 2, 2018

ashcrow commented Jul 3, 2018

Bubblemelon commented Jul 6, 2018 • edited Loading

Setup

Bubblemelon commented Jul 6, 2018

ashcrow commented Jul 6, 2018

Bubblemelon commented Jul 9, 2018

cgwalters commented Jul 9, 2018

Bubblemelon commented Jul 9, 2018 • edited Loading

ashcrow commented Jul 9, 2018

Bubblemelon commented Jul 9, 2018 • edited Loading

Bubblemelon commented Jul 9, 2018 • edited Loading

Bubblemelon commented Jul 10, 2018

mrunalp commented Jul 17, 2018

Bubblemelon commented Jul 17, 2018 • edited Loading

ashcrow commented Jul 23, 2018

crawford commented Jul 23, 2018

ashcrow commented Jul 23, 2018 • edited Loading

Bubblemelon commented Jul 23, 2018 • edited Loading

ashcrow commented Jul 23, 2018

Bubblemelon commented Jul 6, 2018 •

edited

Loading

Bubblemelon commented Jul 9, 2018 •

edited

Loading

Bubblemelon commented Jul 9, 2018 •

edited

Loading

Bubblemelon commented Jul 9, 2018 •

edited

Loading

Bubblemelon commented Jul 17, 2018 •

edited

Loading

ashcrow commented Jul 23, 2018 •

edited

Loading

Bubblemelon commented Jul 23, 2018 •

edited

Loading