-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node 'changes IP' and causes loss of connectivity #5090
Comments
Poking around while writing this wall of text I figured out "tunSSM" is 'ssm-tunnel' which is some janky way to get into the worker that a team member added. So that's probably the cause since it looks like a network hack. I'll be removing ssm-tunnel but I believe calico shouldn't get stuck like it does. I see someone else on stackoverflow with the exact same issue being caused by a gitlab runner which probably does a similar thing so.. Yeah. Leaving this open |
It sounds like something else is creating an interface that Calico is preferring? If that's the case, you can configure Calico's autodetection method to ignore that interface. There's also a WIP PR that might help this which allows configuring Calico to use the Kubernetes node IP rather than detecting from the host itself: projectcalico/node#1242 |
Yes, that PR would be an excellent way to prevent future issues, but I would rather not focus so much on the cause.. |
@Timbus do you not have this section in your calico/node daemonset?
It's intended to monitor BIRD health and report not-live if BIRD is not running. This will trigger Kubernetes to restart the calico/node pod after a minute. |
Or, perhaps our bird liveness check is not fine-grained enough to spot this issue. |
Yep, in the template:
I'm not sure if bird is actually crashing in the container or just complaining it cannot bind. I'm not sure how the container manages the processes. I would guess it's still alive, though the container is unready which helped me find the issue. |
Current Behavior / Issue:
Using AWS EKS + calico + istio.. Calico is humming along with no issues until something changes(?) the node IP on one of our workers and it causes calico to reconfigure:
This is strange enough on its own, something thinks the node is now.. a pod? That's a pod IP.
But then about 15 minutes later, the IP changes back and:
The pod will then forever give us the bird, logs. Endlessly unable to bind until the pod is deleted.
Impact
The worker node essentially loses all connectivity and causes pods to start timing out and retrying until a route works. This causes random health failures and API issues. If the broken worker node is running something critical that isn't HA things go further south.
Expected Behavior
Calico should handle strange network changes either more gracefully, or less gracefully (ie: crash the pod).
Environment
The text was updated successfully, but these errors were encountered: