etcd degradation #795

igayoso · 2017-07-25T16:21:28Z

Hi,

We're using kube-aws for kubernetes production cluster. After few weeks working without problems we saw etcd instance on degradation state. Instances have high load (between 5 and 8) with around 30% CPU and near 3Gb RAM usage. Two days after see this "problem" finally etcd go down and we had to reboot instances to get fresh etcd. Later we upgraded to last etcd version but after few days we saw again degradation problem with these instances.

We are using t2.small instances with 3 etcd on different zones. We will move to m4 instance type because looks like t instance type with burstable option is not enough and we are spending all the credits but I want add some logs and screenshots with the problem. Also I am not sure if etcd is not working well and is "eating" all the memory and this is the reason of load.

Someone with similar problems? Let me know if you need more info, logs, whatever...

Screenshots:

etcd container logs:
logs-etcd-0.txt
logs-etcd-1.txt
logs-etcd-2.txt

danielfm · 2017-07-25T20:55:50Z

I'm not sure t2.small instances are a good choice for your etcd cluster. Like you said, I'd go with at least a t2.medium, or with an instance with a more predictable baseline performance, such as m4.*. I'd also be careful to choose an appropriate configuration for the EBS volume, with a good balance of storage size and IOPS. Having a good disk IO performance is critical for etcd as well.

Out of curiosity, this cluster has how many nodes (controllers+workers)?

redbaron · 2017-07-25T21:16:08Z

interesting that it is rkt run which takes most of CPU. There is also etcdctl on another screenshot. @igayoso, are you using automatic etcd snapshotting by any chance?

igayoso · 2017-07-26T12:56:08Z

@danielfm sorry but I'm using t2.medium instead t2.small. Finally we will move to m4. I was looking to disk IOPS and looks fine, no high IO on the disk. Cluster has 2 controllers (m3.large) , 3 etcd (t2.medium) and 3 workers (m3.large).

@redbaron yep, is strange rkt run is taking mot of CPU. We are using automatic etcd snapshotting, could be the reason or something related?

redbaron · 2017-07-26T13:01:33Z

Are you using 0.9.7? More specifically this change #705 should be replacing rkt run with docker run

igayoso · 2017-07-26T13:28:07Z

We're using v0.9.7-rc.4 @redbaron

redbaron · 2017-07-26T13:31:50Z

Try 0.9.7, it could be that rkt fix wasn't added until final release

igayoso · 2017-07-26T15:48:05Z

#705 was included on rc-4 but thanks for help @redbaron . Finally we moved to m4 and after few hours looks fine. I will keep you posted...

redbaron · 2017-07-26T15:53:00Z

So we need to eliminate rkt run from other calls, can you please check where they are coming from and update this ticket or maybe open PR to do a similar change to #705 if you have time?

Moving to a bigger instance just gives you more time, CPU usage seems to be steadily grow over time, so at some moment you'll have same problem.

igayoso · 2017-07-26T16:18:12Z

@redbaron sure, I will do, no problem. About CPU you're right. Sorry for fast upgrade but it is a production service and we can not take risk. I will continue monitoring and I will update as soon as I have any news.

igayoso · 2017-08-07T14:40:49Z

Sorry for delay. I was testing and checking logs from rkt but I can't see anything strange. Also we moved to C4 instances and no more problems. Basically with C4 instances we didn't have credits and/or limited CPU so this was the mainly reason of the degradation. I was checking same version in the same cluster and compare between both and no problems with rkt or similar. I think you can close the ticket if you are agree please. Thanks!!

igayoso · 2017-08-11T05:42:18Z

After a few days with bigger instance we saw similar behaviors and CPU were increasing like in the past but with more time because we don't have CPU limitations. Yesterday we upgrade k8s cluster and moved to docker instead rkt. We will see next days...

redbaron · 2017-08-16T14:29:31Z

@igayoso , how it is holding up?

igayoso · 2017-08-22T15:57:36Z

Hi,

I was checking minutes ago the graphs with similar issues. With bigger instances we will survive more days but looks like in 1 month we will have to restart etcd again so no news after rkt change. Another detail for these instances is that we have huge traffic on this cluster, I mean a lot of request from clients but I'm not sure if this is possible problem or not. Let me know if you want more tests or information please. Thanks!

redbaron · 2017-08-23T09:26:56Z

@igayoso , there is a PR to update Etcd to 3.2.5: #845 , if you apply similar changes you might find it behaving better (I recall somebody mentioned that 3.2.xxx had some CPU usage optimzations)

redbaron · 2017-08-23T09:27:30Z

While nodes are running, would you mind checking which process is the busiest?

danielfm · 2017-08-31T17:29:18Z

I'm experiencing a strange memory usage pattern after upgrading etcd from v3.1.3 to v3.2.6:

We can see very clearly the memory consumption used to stay stable until the update, after which the memory consumption went through the roof.

redbaron · 2017-08-31T17:39:12Z

@danielfm sounds like something worth reporting to etcd github

flah00 · 2017-09-16T15:08:34Z

I feel like I've run into a similar issue, there was a discussion etcdadm-save and perhaps etcdadm-check having something to do with degradation. I decreased save frequency, in userdata/cloud-config-etcd, from every minute to every five minutes. I saw a baseline decrease in CPU usage, but CPU and "outbound traffic" are on an upward trend, still. I installed etcd 3.2.7 using kube-aws 0.9.8. My cluster.yaml enables both disaster recovery and snapshots.

The following snippet includes an extra unit, which runs on my etcd. I added the unit after CPU began mysteriously increasing. It certainly isn't the cause of my mysterious load.

etcd:
  version: 3.2.7
  snapshot:
clusters
    automated: true
  disasterRecovery:
    automated: true
  count: 3
  subnets:
    - name: ManagedPrivateSubnet1d
  customFiles:
    - path: /etc/profile.d/aws.sh
      permissions: 0644
      owner: root:root
      content: |
        alias aws="docker run -v \$PWD:/work --workdir /work --env-file /etc/environment --rm quay.io/coreos/awscli aws"
    - path: /etc/sumologic-environment
      permissions: 0600
      owner: root:root
      # TODO MANUALLY MANAGED!
      content: |
        COLLECTOR_URL=...
        SOURCE_CATEGORY_PREFIX=zoo/
        LOG_FORMAT=text
        KUBERNETES_META=true
        FLUENTD_SOURCE=systemd
  customSystemdUnits:
    - name: fluentd-pos.service
      command: start
      runtime: true
      content: |
        [Unit]
        Description=Setup fluentd pos directory
        [Service]
        Type=oneshot
        RemainAfterExit=true
        ExecStart=/usr/bin/mkdir -p -m 1777 /var/run/fluentd-pos
    - name: sumologic.service
      command: start
      runtime: true
      content: |
        [Unit]
        Description=Send logs to central server
        AssertFileNotEmpty=/etc/sumologic-environment
        Wants=docker.service
        After=docker.service
        [Service]
        ExecStartPre=/bin/sh -c '(eval $(docker run --rm --env-file /etc/environment quay.io/coreos/awscli aws ecr get-login --region us-east-1))'
        ExecStart=/usr/bin/docker run -l sumo=true -m 256m --cpu-quota=25000 --env-file /etc/sumologic-environment -v /var/lib/docker -v /var/lib/rkt -v /var/log:/mnt/log 221645429527.dkr.ecr.us-east-1.amazonaws.com/fluentd-kubernetes-sumologic:v1.5
        ExecStopPost=/usr/bin/docker rm $(/usr/bin/docker ps -aq --filter label=sumo=true)

fejta-bot · 2019-04-21T19:39:31Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-05-21T20:22:05Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-06-20T21:12:44Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-06-20T21:12:51Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

danielfm mentioned this issue Aug 31, 2017

Strange memory usage pattern after v3.1.3 -> v3.2.6 upgrade etcd-io/etcd#8472

Closed

pedrobizzotto mentioned this issue Sep 28, 2017

Memory exhaustion in ETCD, instances being replaced by the ASG #954

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 21, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 21, 2019

k8s-ci-robot closed this as completed Jun 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd degradation #795

etcd degradation #795

igayoso commented Jul 25, 2017

danielfm commented Jul 25, 2017

redbaron commented Jul 25, 2017

igayoso commented Jul 26, 2017

redbaron commented Jul 26, 2017

igayoso commented Jul 26, 2017

redbaron commented Jul 26, 2017

igayoso commented Jul 26, 2017

redbaron commented Jul 26, 2017

igayoso commented Jul 26, 2017

igayoso commented Aug 7, 2017

igayoso commented Aug 11, 2017

redbaron commented Aug 16, 2017

igayoso commented Aug 22, 2017

redbaron commented Aug 23, 2017

redbaron commented Aug 23, 2017

danielfm commented Aug 31, 2017 •

edited

Loading

redbaron commented Aug 31, 2017

flah00 commented Sep 16, 2017

fejta-bot commented Apr 21, 2019

fejta-bot commented May 21, 2019

fejta-bot commented Jun 20, 2019

k8s-ci-robot commented Jun 20, 2019

etcd degradation #795

etcd degradation #795

Comments

igayoso commented Jul 25, 2017

danielfm commented Jul 25, 2017

redbaron commented Jul 25, 2017

igayoso commented Jul 26, 2017

redbaron commented Jul 26, 2017

igayoso commented Jul 26, 2017

redbaron commented Jul 26, 2017

igayoso commented Jul 26, 2017

redbaron commented Jul 26, 2017

igayoso commented Jul 26, 2017

igayoso commented Aug 7, 2017

igayoso commented Aug 11, 2017

redbaron commented Aug 16, 2017

igayoso commented Aug 22, 2017

redbaron commented Aug 23, 2017

redbaron commented Aug 23, 2017

danielfm commented Aug 31, 2017 • edited Loading

redbaron commented Aug 31, 2017

flah00 commented Sep 16, 2017

fejta-bot commented Apr 21, 2019

fejta-bot commented May 21, 2019

fejta-bot commented Jun 20, 2019

k8s-ci-robot commented Jun 20, 2019

danielfm commented Aug 31, 2017 •

edited

Loading