Severe regression in calico-3.16 with etcdv3 backend #4109

Wiston999 · 2020-10-20T16:18:27Z

calico-node and calico-kube-controller pods starts spamming the following log lines after a full ETCD cluster restart (with etcd downtime):

2020-10-20 13:42:23.762 [ERROR][1] watcher.go 127: Watch channel error error=etcdserver: mvcc: required revision has been compacted
2020-10-20 13:42:23.763 [INFO][1] watchercache.go 96: Watch channel closed by remote - recreate watcher ListRoot="/calico/resources/v3/projectcalico.org/nodes"

Expected Behavior

Tested the same scenario with calico v3.15.3 and a small burst of error messages (about 10) are logged until etcd is back online

Current Behavior

Log messages are repeated endlessly at a rate of 6-9 per second per pod. Not sure if calico reconnects back to ETCD successfully or it remains disconnected from etcd and so the pods need to be restarted.

Possible Solution

If those error messages are expected, they should at least be logged with lower loglevel as they are currently being logged as ERROR.

Steps to Reproduce (for bugs)

Deploy calico version 3.16+ using etcd backend.
Stop all etcd cluster nodes
Restart etcd cluster

Context

Your Environment

Calico version 3.16.1 and 3.16.3
Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes 1.19.2
Operating System and version: Debian buster
ETCD version: 3.4.13

The text was updated successfully, but these errors were encountered:

ravilr · 2020-10-21T01:16:10Z

@caseydavenport @neiljerram sorry to tag you directly. but this needs your attention.

we are also seeing this issue with multiple Calico installations(multiple Kube clusters) using

calico on etcdv3 backend.
Version anything >= 3.16

Symptoms we are seeing are, calico-node and calico-kube-controller pods never seem to recover from the watch channel error, once they lose etcd connection to etcd3 cluster backend. Seems to be easily reproducible by triggering a rolling restart of etcd cluster servers. After that, calico-node and controller pods remains in a 'watch channel error' loop and never seems to recover until the pods are deleted/recreated.

Once the calico pods go into above state, the etcd server's etcd_debugging_mvcc_watcher_total metric keeps going up and the corresponding etcd server go heap memory usage starts climbing up.

I'm yet to trace this back through stepping down to older version of calico (3.15 or 3.14) to see where this broken behavior got introduced. But, projectcalico/libcalico-go#1247 could be where this regression was introduced.
Please check.

ravilr · 2020-10-21T01:20:48Z

below logs are from calico-node pod running node-v3.16.1 version:

2020-10-17 02:09:24.948 [ERROR][79] felix/watcher.go 127: Watch channel error error=etcdserver: mvcc: required revision has been compacted
2020-10-17 02:09:24.948 [ERROR][79] felix/watcher.go 127: Watch channel error error=etcdserver: mvcc: required revision has been compacted
2020-10-17 02:09:24.948 [INFO][79] felix/watchercache.go 96: Watch channel closed by remote - recreate watcher ListRoot="/calico/resources/v3/projectcalico.org/globalnetworkpolicies"
2020-10-17 02:09:24.948 [ERROR][79] felix/watcher.go 127: Watch channel error error=etcdserver: mvcc: required revision has been compacted
2020-10-17 02:09:24.948 [INFO][79] felix/watchercache.go 96: Watch channel closed by remote - recreate watcher ListRoot="/calico/resources/v3/projectcalico.org/globalnetworksets"

when calico-node or the singleton calico-kube-controller pods gets into this bad state, they keep spewing these errors for watches on all paths and also, all watchers felix/watcher, tunnel-ip-allocator/watcher, confd/watcher all logging the same error in continous loop, generating GBs of logs in matter of hours.. and the pods created by cni plugin are never advertised by bird or programmed by Felix, as such the newly created pods on such nodes ends up having broken networking...

ravilr · 2020-10-21T02:43:36Z

I tested the etcd cluster rolling restart with calico on v3.15.3. Both the calico-kube-controller and calico-node pods self-recover from watch channel closure/disruption without issues.

So, the regression seems to be introduced since >= v3.16.0. And seem to be projectcalico/libcalico-go#1247 which is eating up watch errors in lib/backend/etcd3/watcher.go and seems to match with my observation where the number of etcd watchers created on the etcd servers keeps going up once the lib gets into this state.

@Wiston999 can you please update this issue title to 'severe regression in calico-3.16 with etcdv3 backend' ?

Wiston999 · 2020-10-21T08:05:39Z

@Wiston999 can you please update this issue title to 'severe regression in calico-3.16 with etcdv3 backend' ?

Sure, thanks for your analysis.

Sarga · 2020-10-21T09:18:30Z

I have same issue, etcd 3.4.14 and calico 3.16.3

nelljerram · 2020-10-21T10:46:23Z

@ravilr @Wiston999 @Sarga Please can you clarify:

Is there any symptom beyond the crazy amount of logging? ("Severe regression" sounds like it would mean something more functional, as well as the crazy logging.)
Is there ever recovery once the etcd cluster is fully up again? Original description says "Not sure if calico reconnects back to ETCD successfully or it remains disconnected from etcd and so the pods need to be restarted." and @ravilr 's comments seem to indicate no recovery, but I'd like to be sure.

Wiston999 · 2020-10-21T11:26:31Z

Hi,
I can expose the following 3 scenarios:

I have also experienced networking errors in new pods (No route to host, checking with ip route, the route from pod IP to cali* interface in the node is missing) created while calico is spamming those log messages after some hours, fixed when calico pods are restarted.
New pods' network are properly setup if the pods are created just after etcd is restarted. It is, there is logspam but new pods are setup correctly. So I cannot ensure that the logspam is 100% related to pod network problems or some kind of mixed conditions.
I've tested etcd restart with calico 3.15.3 and there is no logspam, just a few error lines (10-15) until etcd is back online.

ravilr · 2020-10-21T18:04:35Z

This is a severe regression because:

calico pods never recover from Watch channel errror's even after the etcd servers are online/after a etcd cluster rolling restart. Only manual intervention of deleting the calico pods, on which DaemonSet/Deployment Kube controllers creating a new ones, fixes the issue.
Also when it hit this bug, calico pods are creating a lot of new watchers against etcd server in quick succession in never ending loop, resulting in massive ballooning of etcd server's memory usage. If left unattended, this could cause etcd server downtime, so triggers a cascading failure..

shcherbak · 2020-10-27T16:42:43Z

the same issue with etcd 3.4.13 and calico 3.16.4

hjkatz · 2020-10-28T16:18:24Z

We are also experiencing this same issue.

Versions:

kubernetes: 3.16.13
calico: 3.16.3
etcd: 3.4.13

We see graphs very similar to this report issue in etcd: etcd-io/etcd#8387

They suggest that there are some client-side metrics for monitoring the unexpected creation of additional watchers.

Steps to reproduce errors:

Upgrade calico-node and calico-controller to 3.16.3
Restart etcd servers in succession (or otherwise create downtime for etcd)
Observe the problem in the calico-node logs
Restart calico-node Pods and the issue goes away

carsten-langer · 2020-10-29T14:55:34Z

We have the same symptoms with Calico 3.16.1 and etcd 3.4.3 and observe the infinite loops both in calico-kube-controllers pod and calico-node pods.

Similar to what @ravilr writes, for us, we never have recovery of Calico after etcd is up again. Only remedy is to restart both the calico-kube-controllers deploy and calico-node daemonset.

Our impact is:

huge memory increase in etcd and thus spikes of etcd delay potentially due to garbage collection in etcd, leading even to k8s-apiserver restarts due to timeouts on lease renewal towards etcd.
huge increase in CPU load in etcd proportionally/linearily following how many calico-nodes are in this infinite loop situation. This is nicely visible: if you do a rolling restart of the calico-node pods, the CPU load of etcd proportionally goes down again with the number of calico-node pods being restarted and thus not in infinite loop anymore. Which to me proves that calico is really sending out all these requests to etcd, thus is still connected to etcd.
Similar to what @Wiston999 writes, some normal pods, when created during such a infinite loop scenario, have network connection problems right after they are created, as if Calico was not able to correctly set up the network due to an overload in the watcher/etcd area. Such pod network will then not come up until we delete the pod and try again. Then it is again a question of probability if the network comes up.

robbrockbank · 2020-10-30T02:31:11Z

I believe @ravilr has correctly pointed to a PR containing a bug in some of the watcher error handling. In particular errors caused by revisions that are obsolete due to compaction will not trigger a full re-sync and therefore we'll keep trying to watch from the same bad revision.

I have put up a fix in libcalico-go. https://github.com/projectcalico/libcalico-go/pull/1337/files

Will push to get this reviewed and merged. Watch this space.

Wiston999 changed the title ~~Logspam after ETCD cluster is offline~~ Severe regression in calico-3.16 with etcdv3 backend Oct 21, 2020

ravilr mentioned this issue Oct 29, 2020

Handle backend watch, if upstream closes channel[ClosedByRemote] projectcalico/libcalico-go#1247

Merged

robbrockbank mentioned this issue Oct 30, 2020

etcd backend not passing watcher errors up the stack, and errors not … projectcalico/libcalico-go#1337

Merged

3 tasks

robbrockbank closed this as completed in projectcalico/libcalico-go#1337 Oct 30, 2020

caseydavenport added impact/high kind/bug likelihood/high labels Nov 2, 2020

0x4c6565 mentioned this issue Nov 23, 2020

Calico version bump to 3.16.5 kubernetes-sigs/kubespray#6943

Closed

leflambeur mentioned this issue Dec 3, 2020

Severe Regression in 3.16.1 Calico - Fixed in 3.16.5 rancher/rancher#30324

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Severe regression in calico-3.16 with etcdv3 backend #4109

Severe regression in calico-3.16 with etcdv3 backend #4109

Wiston999 commented Oct 20, 2020

ravilr commented Oct 21, 2020

ravilr commented Oct 21, 2020 •

edited

Loading

ravilr commented Oct 21, 2020 •

edited

Loading

Wiston999 commented Oct 21, 2020

Sarga commented Oct 21, 2020

nelljerram commented Oct 21, 2020

Wiston999 commented Oct 21, 2020

ravilr commented Oct 21, 2020

shcherbak commented Oct 27, 2020

hjkatz commented Oct 28, 2020

carsten-langer commented Oct 29, 2020 •

edited

Loading

robbrockbank commented Oct 30, 2020

Severe regression in calico-3.16 with etcdv3 backend #4109

Severe regression in calico-3.16 with etcdv3 backend #4109

Comments

Wiston999 commented Oct 20, 2020

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

ravilr commented Oct 21, 2020

ravilr commented Oct 21, 2020 • edited Loading

ravilr commented Oct 21, 2020 • edited Loading

Wiston999 commented Oct 21, 2020

Sarga commented Oct 21, 2020

nelljerram commented Oct 21, 2020

Wiston999 commented Oct 21, 2020

ravilr commented Oct 21, 2020

shcherbak commented Oct 27, 2020

hjkatz commented Oct 28, 2020

carsten-langer commented Oct 29, 2020 • edited Loading

robbrockbank commented Oct 30, 2020

ravilr commented Oct 21, 2020 •

edited

Loading

ravilr commented Oct 21, 2020 •

edited

Loading

carsten-langer commented Oct 29, 2020 •

edited

Loading