-
Notifications
You must be signed in to change notification settings - Fork 294
etcd degradation #795
Comments
I'm not sure t2.small instances are a good choice for your etcd cluster. Like you said, I'd go with at least a t2.medium, or with an instance with a more predictable baseline performance, such as m4.*. I'd also be careful to choose an appropriate configuration for the EBS volume, with a good balance of storage size and IOPS. Having a good disk IO performance is critical for etcd as well. Out of curiosity, this cluster has how many nodes (controllers+workers)? |
interesting that it is |
@danielfm sorry but I'm using t2.medium instead t2.small. Finally we will move to m4. I was looking to disk IOPS and looks fine, no high IO on the disk. Cluster has 2 controllers (m3.large) , 3 etcd (t2.medium) and 3 workers (m3.large). @redbaron yep, is strange |
Are you using 0.9.7? More specifically this change #705 should be replacing |
We're using v0.9.7-rc.4 @redbaron |
Try 0.9.7, it could be that rkt fix wasn't added until final release |
So we need to eliminate rkt run from other calls, can you please check where they are coming from and update this ticket or maybe open PR to do a similar change to #705 if you have time? Moving to a bigger instance just gives you more time, CPU usage seems to be steadily grow over time, so at some moment you'll have same problem. |
@redbaron sure, I will do, no problem. About CPU you're right. Sorry for fast upgrade but it is a production service and we can not take risk. I will continue monitoring and I will update as soon as I have any news. |
Sorry for delay. I was testing and checking logs from rkt but I can't see anything strange. Also we moved to C4 instances and no more problems. Basically with C4 instances we didn't have credits and/or limited CPU so this was the mainly reason of the degradation. I was checking same version in the same cluster and compare between both and no problems with rkt or similar. I think you can close the ticket if you are agree please. Thanks!! |
After a few days with bigger instance we saw similar behaviors and CPU were increasing like in the past but with more time because we don't have CPU limitations. Yesterday we upgrade k8s cluster and moved to docker instead rkt. We will see next days... |
@igayoso , how it is holding up? |
While nodes are running, would you mind checking which process is the busiest? |
@danielfm sounds like something worth reporting to etcd github |
I feel like I've run into a similar issue, there was a discussion etcdadm-save and perhaps etcdadm-check having something to do with degradation. I decreased save frequency, in The following snippet includes an extra unit, which runs on my etcd. I added the unit after CPU began mysteriously increasing. It certainly isn't the cause of my mysterious load. etcd:
version: 3.2.7
snapshot:
clusters
automated: true
disasterRecovery:
automated: true
count: 3
subnets:
- name: ManagedPrivateSubnet1d
customFiles:
- path: /etc/profile.d/aws.sh
permissions: 0644
owner: root:root
content: |
alias aws="docker run -v \$PWD:/work --workdir /work --env-file /etc/environment --rm quay.io/coreos/awscli aws"
- path: /etc/sumologic-environment
permissions: 0600
owner: root:root
# TODO MANUALLY MANAGED!
content: |
COLLECTOR_URL=...
SOURCE_CATEGORY_PREFIX=zoo/
LOG_FORMAT=text
KUBERNETES_META=true
FLUENTD_SOURCE=systemd
customSystemdUnits:
- name: fluentd-pos.service
command: start
runtime: true
content: |
[Unit]
Description=Setup fluentd pos directory
[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/usr/bin/mkdir -p -m 1777 /var/run/fluentd-pos
- name: sumologic.service
command: start
runtime: true
content: |
[Unit]
Description=Send logs to central server
AssertFileNotEmpty=/etc/sumologic-environment
Wants=docker.service
After=docker.service
[Service]
ExecStartPre=/bin/sh -c '(eval $(docker run --rm --env-file /etc/environment quay.io/coreos/awscli aws ecr get-login --region us-east-1))'
ExecStart=/usr/bin/docker run -l sumo=true -m 256m --cpu-quota=25000 --env-file /etc/sumologic-environment -v /var/lib/docker -v /var/lib/rkt -v /var/log:/mnt/log 221645429527.dkr.ecr.us-east-1.amazonaws.com/fluentd-kubernetes-sumologic:v1.5
ExecStopPost=/usr/bin/docker rm $(/usr/bin/docker ps -aq --filter label=sumo=true)
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi,
We're using kube-aws for kubernetes production cluster. After few weeks working without problems we saw etcd instance on degradation state. Instances have high load (between 5 and 8) with around 30% CPU and near 3Gb RAM usage. Two days after see this "problem" finally etcd go down and we had to reboot instances to get fresh etcd. Later we upgraded to last etcd version but after few days we saw again degradation problem with these instances.
We are using t2.small instances with 3 etcd on different zones. We will move to m4 instance type because looks like t instance type with burstable option is not enough and we are spending all the credits but I want add some logs and screenshots with the problem. Also I am not sure if etcd is not working well and is "eating" all the memory and this is the reason of load.
Someone with similar problems? Let me know if you need more info, logs, whatever...
Screenshots:
etcd container logs:
logs-etcd-0.txt
logs-etcd-1.txt
logs-etcd-2.txt
The text was updated successfully, but these errors were encountered: