etcd compaction stopped. #5936

zapman449 · 2018-10-13T00:45:17Z

1. What kops version are you running? The command kops version, will display
this information.
Version 1.10.0 (git-8b52ea6d1)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.7", GitCommit:"0c38c362511b20a098d7cd855f1314dad92c2780", GitTreeState:"clean", BuildDate:"2018-08-20T10:09:03Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.7", GitCommit:"0c38c362511b20a098d7cd855f1314dad92c2780", GitTreeState:"clean", BuildDate:"2018-08-20T09:56:31Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
Left a functioning cluster alone for a few days. Came back to errors running kubectl apply -f foo.yaml:

Error from server: error when creating "kube_cluster_metrics_rbac.yaml": etcdserver: mvcc: database space exceeded

5. What happened after the commands executed?
apply failed

6. What did you expect to happen?
apply to succeed

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

ETCD relevant bits:

  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-west-2a
      name: -2a
    - instanceGroup: master-us-west-2c
      name: -2c
    - instanceGroup: master-us-west-2b
      name: -2b
    manager:
      image: kopeio/etcd-manager:latest
    name: main
    version: 3.2.18
  - etcdMembers:
    - instanceGroup: master-us-west-2a
      name: -2a
    - instanceGroup: master-us-west-2c
      name: -2c
    - instanceGroup: master-us-west-2b
      name: -2b
    manager:
      image: kopeio/etcd-manager:latest
    name: events
    version: 3.2.18

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

Highly relevant ticket #4005
This might be a duplicate.

Logs show 5 minutely compactions, until suddenly stopping 7 days ago:

$ grep compact /var/log/etcd.log | tail -30
2018-10-05 13:38:41.404362 I | mvcc: store.index: compact 153466
2018-10-05 13:38:41.405919 I | mvcc: finished scheduled compaction at 153466 (took 1.127869ms)
2018-10-05 13:43:41.410039 I | mvcc: store.index: compact 153984
2018-10-05 13:43:41.411714 I | mvcc: finished scheduled compaction at 153984 (took 1.234292ms)
2018-10-05 13:48:41.415629 I | mvcc: store.index: compact 154503
2018-10-05 13:48:41.417139 I | mvcc: finished scheduled compaction at 154503 (took 1.141945ms)
2018-10-05 13:53:41.422599 I | mvcc: store.index: compact 155024
2018-10-05 13:53:41.424172 I | mvcc: finished scheduled compaction at 155024 (took 1.195964ms)
2018-10-05 13:58:41.429730 I | mvcc: store.index: compact 155542
2018-10-05 13:58:41.431275 I | mvcc: finished scheduled compaction at 155542 (took 1.143702ms)
2018-10-05 14:03:41.436501 I | mvcc: store.index: compact 156062
2018-10-05 14:03:41.438193 I | mvcc: finished scheduled compaction at 156062 (took 1.208913ms)
2018-10-05 14:08:41.442659 I | mvcc: store.index: compact 156580
2018-10-05 14:08:41.444193 I | mvcc: finished scheduled compaction at 156580 (took 1.13716ms)
2018-10-05 20:30:31.268294 I | etcdserver: compacted raft log at 195002
2018-10-06 13:38:10.697073 I | etcdserver: compacted raft log at 295003
2018-10-07 07:13:22.475704 I | etcdserver: compacted raft log at 395004
2018-10-08 01:45:50.962248 I | etcdserver: compacted raft log at 495005
2018-10-08 20:21:48.255609 I | etcdserver: compacted raft log at 595006
2018-10-09 14:56:39.884253 I | etcdserver: compacted raft log at 695007
2018-10-10 09:32:16.886954 I | etcdserver: compacted raft log at 795008
2018-10-10 21:52:54.004659 I | etcdserver: compacted raft log at 895009
2018-10-11 05:58:08.475681 I | etcdserver: compacted raft log at 995010
2018-10-11 14:03:12.917882 I | etcdserver: compacted raft log at 1095011
2018-10-11 22:06:49.750069 I | etcdserver: compacted raft log at 1195012
2018-10-12 06:12:09.806502 I | etcdserver: compacted raft log at 1295013
2018-10-12 14:17:41.242084 I | etcdserver: compacted raft log at 1395014
2018-10-12 22:21:29.007856 I | etcdserver: compacted raft log at 1495015
2018-10-12 23:43:09.039160 I | mvcc: store.index: compact 812548
2018-10-12 23:43:31.607978 I | mvcc: finished scheduled compaction at 812548 (took 22.568123354s)

Following a modified version of the steps in #4005 I was able to get a compaction going (that last log entry), but after that point, I'm not seeing more scheduled, and I'm not seeing the compactions from the kube-api I'm expecting.

The text was updated successfully, but these errors were encountered:

zapman449 · 2018-10-23T14:02:58Z

so, I've hit this again, this time with a cluster less than 24 hours old. The fix from #4005 did not work this time, it turns out the DNS entries in route53 were pointing to the placeholder IP of 203.0.113.123 .

I manually updated the 6 dns entries in route53, and the etcd cluster (and kube cluster) resumed functioning

zapman449 · 2018-10-23T14:16:23Z

~~Extra wrinkle: kube-apiserver is restarting every minute presumably due to this etcd issue:~~

the issue about kube-apiserver restarting minutely was self-inflicted, and not related to this issue.

zapman449 · 2018-10-23T14:23:43Z

and this certainly isn't helping:

$ sudo grep -v " got ping " /var/log/etcd.log | grep -v " AWS API Request:" | grep -v "updating hosts: map" | grep -v "we are not leader" | grep -v "starting controller iteration" | grep -v -- "peers.*peer.*etcd--2a" | grep -- "10-23 14:"
2018-10-23 14:03:46.757352 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 44.786295ms)
2018-10-23 14:03:46.757559 W | etcdserver: server is likely overloaded
2018-10-23 14:03:46.757593 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 45.037746ms)
2018-10-23 14:03:46.757616 W | etcdserver: server is likely overloaded
2018-10-23 14:12:47.670713 W | etcdserver: apply entries took too long [119.490314ms for 1 entries]
2018-10-23 14:12:47.670762 W | etcdserver: avoid queries with large range/delete range!
2018-10-23 14:12:54.029650 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 43.582398ms)
2018-10-23 14:12:54.030660 W | etcdserver: server is likely overloaded
2018-10-23 14:12:54.030747 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 44.687879ms)
2018-10-23 14:12:54.030904 W | etcdserver: server is likely overloaded

zapman449 · 2018-10-23T15:40:19Z

Provisional finding: kube 1.10 / kops 1.10 does not play nice with etcd 3.2.18 (or etcd 3.2.24). Need to use etcd 3.1.12 . This factoid was derived from https://github.com/kopeio/etcd-manager/blob/f4db9a739e90833cb8f7152f58535e30c84d33a0/images/BUILD#L13-L18 .

zapman449 · 2018-10-23T15:45:02Z

and... I'm in a compaction error again.

fejta-bot · 2019-01-22T12:23:44Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-02-21T12:41:33Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-03-23T13:24:16Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-03-23T13:24:24Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

zapman449 closed this as completed Oct 23, 2018

zapman449 reopened this Oct 24, 2018

zapman449 closed this as completed Oct 24, 2018

zapman449 reopened this Oct 24, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 21, 2019

k8s-ci-robot closed this as completed Mar 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd compaction stopped. #5936

etcd compaction stopped. #5936

zapman449 commented Oct 13, 2018

zapman449 commented Oct 23, 2018

zapman449 commented Oct 23, 2018 •

edited

Loading

zapman449 commented Oct 23, 2018

zapman449 commented Oct 23, 2018

zapman449 commented Oct 23, 2018

fejta-bot commented Jan 22, 2019

fejta-bot commented Feb 21, 2019

fejta-bot commented Mar 23, 2019

k8s-ci-robot commented Mar 23, 2019

etcd compaction stopped. #5936

etcd compaction stopped. #5936

Comments

zapman449 commented Oct 13, 2018

zapman449 commented Oct 23, 2018

zapman449 commented Oct 23, 2018 • edited Loading

zapman449 commented Oct 23, 2018

zapman449 commented Oct 23, 2018

zapman449 commented Oct 23, 2018

fejta-bot commented Jan 22, 2019

fejta-bot commented Feb 21, 2019

fejta-bot commented Mar 23, 2019

k8s-ci-robot commented Mar 23, 2019

zapman449 commented Oct 23, 2018 •

edited

Loading