Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe regression in calico-3.16 with etcdv3 backend #4109

Closed
Wiston999 opened this issue Oct 20, 2020 · 12 comments · Fixed by projectcalico/libcalico-go#1337
Closed

Severe regression in calico-3.16 with etcdv3 backend #4109

Wiston999 opened this issue Oct 20, 2020 · 12 comments · Fixed by projectcalico/libcalico-go#1337

Comments

@Wiston999
Copy link

calico-node and calico-kube-controller pods starts spamming the following log lines after a full ETCD cluster restart (with etcd downtime):

2020-10-20 13:42:23.762 [ERROR][1] watcher.go 127: Watch channel error error=etcdserver: mvcc: required revision has been compacted
2020-10-20 13:42:23.763 [INFO][1] watchercache.go 96: Watch channel closed by remote - recreate watcher ListRoot="/calico/resources/v3/projectcalico.org/nodes"

Expected Behavior

Tested the same scenario with calico v3.15.3 and a small burst of error messages (about 10) are logged until etcd is back online

Current Behavior

Log messages are repeated endlessly at a rate of 6-9 per second per pod. Not sure if calico reconnects back to ETCD successfully or it remains disconnected from etcd and so the pods need to be restarted.

Possible Solution

If those error messages are expected, they should at least be logged with lower loglevel as they are currently being logged as ERROR.

Steps to Reproduce (for bugs)

  1. Deploy calico version 3.16+ using etcd backend.
  2. Stop all etcd cluster nodes
  3. Restart etcd cluster

Context

Your Environment

  • Calico version 3.16.1 and 3.16.3
  • Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes 1.19.2
  • Operating System and version: Debian buster
  • ETCD version: 3.4.13
@ravilr
Copy link

ravilr commented Oct 21, 2020

@caseydavenport @neiljerram sorry to tag you directly. but this needs your attention.

we are also seeing this issue with multiple Calico installations(multiple Kube clusters) using

  • calico on etcdv3 backend.
  • Version anything >= 3.16

Symptoms we are seeing are, calico-node and calico-kube-controller pods never seem to recover from the watch channel error, once they lose etcd connection to etcd3 cluster backend. Seems to be easily reproducible by triggering a rolling restart of etcd cluster servers. After that, calico-node and controller pods remains in a 'watch channel error' loop and never seems to recover until the pods are deleted/recreated.

Once the calico pods go into above state, the etcd server's etcd_debugging_mvcc_watcher_total metric keeps going up and the corresponding etcd server go heap memory usage starts climbing up.

I'm yet to trace this back through stepping down to older version of calico (3.15 or 3.14) to see where this broken behavior got introduced. But, projectcalico/libcalico-go#1247 could be where this regression was introduced.
Please check.

@ravilr
Copy link

ravilr commented Oct 21, 2020

below logs are from calico-node pod running node-v3.16.1 version:

2020-10-17 02:09:24.948 [ERROR][79] felix/watcher.go 127: Watch channel error error=etcdserver: mvcc: required revision has been compacted
2020-10-17 02:09:24.948 [ERROR][79] felix/watcher.go 127: Watch channel error error=etcdserver: mvcc: required revision has been compacted
2020-10-17 02:09:24.948 [INFO][79] felix/watchercache.go 96: Watch channel closed by remote - recreate watcher ListRoot="/calico/resources/v3/projectcalico.org/globalnetworkpolicies"
2020-10-17 02:09:24.948 [ERROR][79] felix/watcher.go 127: Watch channel error error=etcdserver: mvcc: required revision has been compacted
2020-10-17 02:09:24.948 [INFO][79] felix/watchercache.go 96: Watch channel closed by remote - recreate watcher ListRoot="/calico/resources/v3/projectcalico.org/globalnetworksets"

when calico-node or the singleton calico-kube-controller pods gets into this bad state, they keep spewing these errors for watches on all paths and also, all watchers felix/watcher, tunnel-ip-allocator/watcher, confd/watcher all logging the same error in continous loop, generating GBs of logs in matter of hours.. and the pods created by cni plugin are never advertised by bird or programmed by Felix, as such the newly created pods on such nodes ends up having broken networking...

@ravilr
Copy link

ravilr commented Oct 21, 2020

I tested the etcd cluster rolling restart with calico on v3.15.3. Both the calico-kube-controller and calico-node pods self-recover from watch channel closure/disruption without issues.

So, the regression seems to be introduced since >= v3.16.0. And seem to be projectcalico/libcalico-go#1247 which is eating up watch errors in lib/backend/etcd3/watcher.go and seems to match with my observation where the number of etcd watchers created on the etcd servers keeps going up once the lib gets into this state.

@Wiston999 can you please update this issue title to 'severe regression in calico-3.16 with etcdv3 backend' ?

@Wiston999
Copy link
Author

@Wiston999 can you please update this issue title to 'severe regression in calico-3.16 with etcdv3 backend' ?

Sure, thanks for your analysis.

@Wiston999 Wiston999 changed the title Logspam after ETCD cluster is offline Severe regression in calico-3.16 with etcdv3 backend Oct 21, 2020
@Sarga
Copy link

Sarga commented Oct 21, 2020

I have same issue, etcd 3.4.14 and calico 3.16.3

@nelljerram
Copy link
Member

@ravilr @Wiston999 @Sarga Please can you clarify:

  1. Is there any symptom beyond the crazy amount of logging? ("Severe regression" sounds like it would mean something more functional, as well as the crazy logging.)

  2. Is there ever recovery once the etcd cluster is fully up again? Original description says "Not sure if calico reconnects back to ETCD successfully or it remains disconnected from etcd and so the pods need to be restarted." and @ravilr 's comments seem to indicate no recovery, but I'd like to be sure.

@Wiston999
Copy link
Author

Hi,
I can expose the following 3 scenarios:

  1. I have also experienced networking errors in new pods (No route to host, checking with ip route, the route from pod IP to cali* interface in the node is missing) created while calico is spamming those log messages after some hours, fixed when calico pods are restarted.
  2. New pods' network are properly setup if the pods are created just after etcd is restarted. It is, there is logspam but new pods are setup correctly. So I cannot ensure that the logspam is 100% related to pod network problems or some kind of mixed conditions.
  3. I've tested etcd restart with calico 3.15.3 and there is no logspam, just a few error lines (10-15) until etcd is back online.

@ravilr
Copy link

ravilr commented Oct 21, 2020

This is a severe regression because:

  • calico pods never recover from Watch channel errror's even after the etcd servers are online/after a etcd cluster rolling restart. Only manual intervention of deleting the calico pods, on which DaemonSet/Deployment Kube controllers creating a new ones, fixes the issue.

  • Also when it hit this bug, calico pods are creating a lot of new watchers against etcd server in quick succession in never ending loop, resulting in massive ballooning of etcd server's memory usage. If left unattended, this could cause etcd server downtime, so triggers a cascading failure..

@shcherbak
Copy link

the same issue with etcd 3.4.13 and calico 3.16.4

@hjkatz
Copy link

hjkatz commented Oct 28, 2020

We are also experiencing this same issue.

Versions:

  • kubernetes: 3.16.13
  • calico: 3.16.3
  • etcd: 3.4.13

We see graphs very similar to this report issue in etcd: etcd-io/etcd#8387

They suggest that there are some client-side metrics for monitoring the unexpected creation of additional watchers.

Steps to reproduce errors:

  1. Upgrade calico-node and calico-controller to 3.16.3
  2. Restart etcd servers in succession (or otherwise create downtime for etcd)
  3. Observe the problem in the calico-node logs
  4. Restart calico-node Pods and the issue goes away

@carsten-langer
Copy link

carsten-langer commented Oct 29, 2020

We have the same symptoms with Calico 3.16.1 and etcd 3.4.3 and observe the infinite loops both in calico-kube-controllers pod and calico-node pods.

Similar to what @ravilr writes, for us, we never have recovery of Calico after etcd is up again. Only remedy is to restart both the calico-kube-controllers deploy and calico-node daemonset.

Our impact is:

  • huge memory increase in etcd and thus spikes of etcd delay potentially due to garbage collection in etcd, leading even to k8s-apiserver restarts due to timeouts on lease renewal towards etcd.
  • huge increase in CPU load in etcd proportionally/linearily following how many calico-nodes are in this infinite loop situation. This is nicely visible: if you do a rolling restart of the calico-node pods, the CPU load of etcd proportionally goes down again with the number of calico-node pods being restarted and thus not in infinite loop anymore. Which to me proves that calico is really sending out all these requests to etcd, thus is still connected to etcd.
  • Similar to what @Wiston999 writes, some normal pods, when created during such a infinite loop scenario, have network connection problems right after they are created, as if Calico was not able to correctly set up the network due to an overload in the watcher/etcd area. Such pod network will then not come up until we delete the pod and try again. Then it is again a question of probability if the network comes up.

@robbrockbank
Copy link
Contributor

I believe @ravilr has correctly pointed to a PR containing a bug in some of the watcher error handling. In particular errors caused by revisions that are obsolete due to compaction will not trigger a full re-sync and therefore we'll keep trying to watch from the same bad revision.

I have put up a fix in libcalico-go. https://github.com/projectcalico/libcalico-go/pull/1337/files

Will push to get this reviewed and merged. Watch this space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants