-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd compaction stopped. #5936
Comments
so, I've hit this again, this time with a cluster less than 24 hours old. The fix from #4005 did not work this time, it turns out the DNS entries in route53 were pointing to the placeholder IP of 203.0.113.123 . I manually updated the 6 dns entries in route53, and the etcd cluster (and kube cluster) resumed functioning |
the issue about kube-apiserver restarting minutely was self-inflicted, and not related to this issue. |
and this certainly isn't helping: $ sudo grep -v " got ping " /var/log/etcd.log | grep -v " AWS API Request:" | grep -v "updating hosts: map" | grep -v "we are not leader" | grep -v "starting controller iteration" | grep -v -- "peers.*peer.*etcd--2a" | grep -- "10-23 14:"
2018-10-23 14:03:46.757352 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 44.786295ms)
2018-10-23 14:03:46.757559 W | etcdserver: server is likely overloaded
2018-10-23 14:03:46.757593 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 45.037746ms)
2018-10-23 14:03:46.757616 W | etcdserver: server is likely overloaded
2018-10-23 14:12:47.670713 W | etcdserver: apply entries took too long [119.490314ms for 1 entries]
2018-10-23 14:12:47.670762 W | etcdserver: avoid queries with large range/delete range!
2018-10-23 14:12:54.029650 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 43.582398ms)
2018-10-23 14:12:54.030660 W | etcdserver: server is likely overloaded
2018-10-23 14:12:54.030747 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 44.687879ms)
2018-10-23 14:12:54.030904 W | etcdserver: server is likely overloaded |
Provisional finding: kube 1.10 / kops 1.10 does not play nice with etcd 3.2.18 (or etcd 3.2.24). Need to use etcd 3.1.12 . This factoid was derived from https://github.com/kopeio/etcd-manager/blob/f4db9a739e90833cb8f7152f58535e30c84d33a0/images/BUILD#L13-L18 . |
and... I'm in a compaction error again. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1. What
kops
version are you running? The commandkops version
, will displaythis information.
Version 1.10.0 (git-8b52ea6d1)
2. What Kubernetes version are you running?
kubectl version
will print theversion if a cluster is running or provide the Kubernetes version specified as
a
kops
flag.3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
Left a functioning cluster alone for a few days. Came back to errors running kubectl apply -f foo.yaml:
Error from server: error when creating "kube_cluster_metrics_rbac.yaml": etcdserver: mvcc: database space exceeded
5. What happened after the commands executed?
apply failed
6. What did you expect to happen?
apply to succeed
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest.You may want to remove your cluster name and other sensitive information.
ETCD relevant bits:
8. Please run the commands with most verbose logging by adding the
-v 10
flag.Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know?
Highly relevant ticket #4005
This might be a duplicate.
Logs show 5 minutely compactions, until suddenly stopping 7 days ago:
Following a modified version of the steps in #4005 I was able to get a compaction going (that last log entry), but after that point, I'm not seeing more scheduled, and I'm not seeing the compactions from the kube-api I'm expecting.
The text was updated successfully, but these errors were encountered: