-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Severe regression in calico-3.16 with etcdv3 backend #4109
Comments
@caseydavenport @neiljerram sorry to tag you directly. but this needs your attention. we are also seeing this issue with multiple Calico installations(multiple Kube clusters) using
Symptoms we are seeing are, calico-node and calico-kube-controller pods never seem to recover from the Once the calico pods go into above state, the etcd server's I'm yet to trace this back through stepping down to older version of calico (3.15 or 3.14) to see where this broken behavior got introduced. But, projectcalico/libcalico-go#1247 could be where this regression was introduced. |
below logs are from calico-node pod running node-v3.16.1 version:
when calico-node or the singleton calico-kube-controller pods gets into this bad state, they keep spewing these errors for watches on all paths and also, all watchers felix/watcher, tunnel-ip-allocator/watcher, confd/watcher all logging the same error in continous loop, generating GBs of logs in matter of hours.. and the pods created by cni plugin are never advertised by bird or programmed by Felix, as such the newly created pods on such nodes ends up having broken networking... |
I tested the etcd cluster rolling restart with calico on v3.15.3. Both the calico-kube-controller and calico-node pods self-recover from watch channel closure/disruption without issues. So, the regression seems to be introduced since >= v3.16.0. And seem to be projectcalico/libcalico-go#1247 which is eating up watch errors in lib/backend/etcd3/watcher.go and seems to match with my observation where the number of etcd watchers created on the etcd servers keeps going up once the lib gets into this state. @Wiston999 can you please update this issue title to 'severe regression in calico-3.16 with etcdv3 backend' ? |
Sure, thanks for your analysis. |
I have same issue, etcd 3.4.14 and calico 3.16.3 |
@ravilr @Wiston999 @Sarga Please can you clarify:
|
Hi,
|
This is a severe regression because:
|
the same issue with etcd 3.4.13 and calico 3.16.4 |
We are also experiencing this same issue. Versions:
We see graphs very similar to this report issue in etcd: etcd-io/etcd#8387 They suggest that there are some client-side metrics for monitoring the unexpected creation of additional watchers. Steps to reproduce errors:
|
We have the same symptoms with Calico 3.16.1 and etcd 3.4.3 and observe the infinite loops both in calico-kube-controllers pod and calico-node pods. Similar to what @ravilr writes, for us, we never have recovery of Calico after etcd is up again. Only remedy is to restart both the calico-kube-controllers deploy and calico-node daemonset. Our impact is:
|
I believe @ravilr has correctly pointed to a PR containing a bug in some of the watcher error handling. In particular errors caused by revisions that are obsolete due to compaction will not trigger a full re-sync and therefore we'll keep trying to watch from the same bad revision. I have put up a fix in libcalico-go. https://github.com/projectcalico/libcalico-go/pull/1337/files Will push to get this reviewed and merged. Watch this space. |
calico-node
andcalico-kube-controller
pods starts spamming the following log lines after a full ETCD cluster restart (with etcd downtime):Expected Behavior
Tested the same scenario with calico v3.15.3 and a small burst of error messages (about 10) are logged until etcd is back online
Current Behavior
Log messages are repeated endlessly at a rate of 6-9 per second per pod. Not sure if calico reconnects back to ETCD successfully or it remains disconnected from etcd and so the pods need to be restarted.
Possible Solution
If those error messages are expected, they should at least be logged with lower loglevel as they are currently being logged as ERROR.
Steps to Reproduce (for bugs)
Context
Your Environment
The text was updated successfully, but these errors were encountered: