Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

etcd degradation #795

Closed
igayoso opened this issue Jul 25, 2017 · 22 comments
Closed

etcd degradation #795

igayoso opened this issue Jul 25, 2017 · 22 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@igayoso
Copy link

igayoso commented Jul 25, 2017

Hi,

We're using kube-aws for kubernetes production cluster. After few weeks working without problems we saw etcd instance on degradation state. Instances have high load (between 5 and 8) with around 30% CPU and near 3Gb RAM usage. Two days after see this "problem" finally etcd go down and we had to reboot instances to get fresh etcd. Later we upgraded to last etcd version but after few days we saw again degradation problem with these instances.

We are using t2.small instances with 3 etcd on different zones. We will move to m4 instance type because looks like t instance type with burstable option is not enough and we are spending all the credits but I want add some logs and screenshots with the problem. Also I am not sure if etcd is not working well and is "eating" all the memory and this is the reason of load.

Someone with similar problems? Let me know if you need more info, logs, whatever...

Screenshots:
selection_001
selection_002
selection_003
selection_004
selection_005

etcd container logs:
logs-etcd-0.txt
logs-etcd-1.txt
logs-etcd-2.txt

@danielfm
Copy link
Contributor

I'm not sure t2.small instances are a good choice for your etcd cluster. Like you said, I'd go with at least a t2.medium, or with an instance with a more predictable baseline performance, such as m4.*. I'd also be careful to choose an appropriate configuration for the EBS volume, with a good balance of storage size and IOPS. Having a good disk IO performance is critical for etcd as well.

Out of curiosity, this cluster has how many nodes (controllers+workers)?

@redbaron
Copy link
Contributor

interesting that it is rkt run which takes most of CPU. There is also etcdctl on another screenshot. @igayoso, are you using automatic etcd snapshotting by any chance?

@igayoso
Copy link
Author

igayoso commented Jul 26, 2017

@danielfm sorry but I'm using t2.medium instead t2.small. Finally we will move to m4. I was looking to disk IOPS and looks fine, no high IO on the disk. Cluster has 2 controllers (m3.large) , 3 etcd (t2.medium) and 3 workers (m3.large).

@redbaron yep, is strange rkt run is taking mot of CPU. We are using automatic etcd snapshotting, could be the reason or something related?

@redbaron
Copy link
Contributor

Are you using 0.9.7? More specifically this change #705 should be replacing rkt run with docker run

@igayoso
Copy link
Author

igayoso commented Jul 26, 2017

We're using v0.9.7-rc.4 @redbaron

@redbaron
Copy link
Contributor

Try 0.9.7, it could be that rkt fix wasn't added until final release

@igayoso
Copy link
Author

igayoso commented Jul 26, 2017

#705 was included on rc-4 but thanks for help @redbaron . Finally we moved to m4 and after few hours looks fine. I will keep you posted...

@redbaron
Copy link
Contributor

So we need to eliminate rkt run from other calls, can you please check where they are coming from and update this ticket or maybe open PR to do a similar change to #705 if you have time?

Moving to a bigger instance just gives you more time, CPU usage seems to be steadily grow over time, so at some moment you'll have same problem.

@igayoso
Copy link
Author

igayoso commented Jul 26, 2017

@redbaron sure, I will do, no problem. About CPU you're right. Sorry for fast upgrade but it is a production service and we can not take risk. I will continue monitoring and I will update as soon as I have any news.

@igayoso
Copy link
Author

igayoso commented Aug 7, 2017

Sorry for delay. I was testing and checking logs from rkt but I can't see anything strange. Also we moved to C4 instances and no more problems. Basically with C4 instances we didn't have credits and/or limited CPU so this was the mainly reason of the degradation. I was checking same version in the same cluster and compare between both and no problems with rkt or similar. I think you can close the ticket if you are agree please. Thanks!!

@igayoso
Copy link
Author

igayoso commented Aug 11, 2017

After a few days with bigger instance we saw similar behaviors and CPU were increasing like in the past but with more time because we don't have CPU limitations. Yesterday we upgrade k8s cluster and moved to docker instead rkt. We will see next days...

@redbaron
Copy link
Contributor

@igayoso , how it is holding up?

@igayoso
Copy link
Author

igayoso commented Aug 22, 2017

Hi,

I was checking minutes ago the graphs with similar issues. With bigger instances we will survive more days but looks like in 1 month we will have to restart etcd again so no news after rkt change. Another detail for these instances is that we have huge traffic on this cluster, I mean a lot of request from clients but I'm not sure if this is possible problem or not. Let me know if you want more tests or information please. Thanks!
image

@redbaron
Copy link
Contributor

@igayoso , there is a PR to update Etcd to 3.2.5: #845 , if you apply similar changes you might find it behaving better (I recall somebody mentioned that 3.2.xxx had some CPU usage optimzations)

@redbaron
Copy link
Contributor

While nodes are running, would you mind checking which process is the busiest?

@danielfm
Copy link
Contributor

danielfm commented Aug 31, 2017

I'm experiencing a strange memory usage pattern after upgrading etcd from v3.1.3 to v3.2.6:

image

We can see very clearly the memory consumption used to stay stable until the update, after which the memory consumption went through the roof.

@redbaron
Copy link
Contributor

@danielfm sounds like something worth reporting to etcd github

@flah00
Copy link
Contributor

flah00 commented Sep 16, 2017

I feel like I've run into a similar issue, there was a discussion etcdadm-save and perhaps etcdadm-check having something to do with degradation. I decreased save frequency, in userdata/cloud-config-etcd, from every minute to every five minutes. I saw a baseline decrease in CPU usage, but CPU and "outbound traffic" are on an upward trend, still. I installed etcd 3.2.7 using kube-aws 0.9.8. My cluster.yaml enables both disaster recovery and snapshots.

The following snippet includes an extra unit, which runs on my etcd. I added the unit after CPU began mysteriously increasing. It certainly isn't the cause of my mysterious load.

etcd:
  version: 3.2.7
  snapshot:
clusters
    automated: true
  disasterRecovery:
    automated: true
  count: 3
  subnets:
    - name: ManagedPrivateSubnet1d
  customFiles:
    - path: /etc/profile.d/aws.sh
      permissions: 0644
      owner: root:root
      content: |
        alias aws="docker run -v \$PWD:/work --workdir /work --env-file /etc/environment --rm quay.io/coreos/awscli aws"
    - path: /etc/sumologic-environment
      permissions: 0600
      owner: root:root
      # TODO MANUALLY MANAGED!
      content: |
        COLLECTOR_URL=...
        SOURCE_CATEGORY_PREFIX=zoo/
        LOG_FORMAT=text
        KUBERNETES_META=true
        FLUENTD_SOURCE=systemd
  customSystemdUnits:
    - name: fluentd-pos.service
      command: start
      runtime: true
      content: |
        [Unit]
        Description=Setup fluentd pos directory
        [Service]
        Type=oneshot
        RemainAfterExit=true
        ExecStart=/usr/bin/mkdir -p -m 1777 /var/run/fluentd-pos
    - name: sumologic.service
      command: start
      runtime: true
      content: |
        [Unit]
        Description=Send logs to central server
        AssertFileNotEmpty=/etc/sumologic-environment
        Wants=docker.service
        After=docker.service
        [Service]
        ExecStartPre=/bin/sh -c '(eval $(docker run --rm --env-file /etc/environment quay.io/coreos/awscli aws ecr get-login --region us-east-1))'
        ExecStart=/usr/bin/docker run -l sumo=true -m 256m --cpu-quota=25000 --env-file /etc/sumologic-environment -v /var/lib/docker -v /var/lib/rkt -v /var/log:/mnt/log 221645429527.dkr.ecr.us-east-1.amazonaws.com/fluentd-kubernetes-sumologic:v1.5
        ExecStopPost=/usr/bin/docker rm $(/usr/bin/docker ps -aq --filter label=sumo=true)

screen shot 2017-09-16 at 11 02 34 am

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 21, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 21, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants