Node 'changes IP' and causes loss of connectivity #5090

Timbus · 2021-11-16T23:59:08Z

Current Behavior / Issue:

Using AWS EKS + calico + istio.. Calico is humming along with no issues until something changes(?) the node IP on one of our workers and it causes calico to reconfigure:

2021-11-16 04:12:14.544 [INFO][58] monitor-addresses/startup.go 774: Using autodetected IPv4 address on interface eth0: 10.231.134.62/24
2021-11-16 04:12:47.358 [INFO][57] felix/summary.go 100: Summarising 10 dataplane reconciliation loops over 1m3.1s: avg=19ms longest=61ms (resync-nat-v4)
2021-11-16 04:13:14.545 [INFO][58] monitor-addresses/startup.go 774: Using autodetected IPv4 address on interface eth0: 10.231.134.62/24
# ... oh no:
2021-11-16 04:13:16.672 [INFO][57] felix/int_dataplane.go 1039: Linux interface addrs changed. addrs=set.mapSet{"100.64.236.245":set.empty{}} ifaceName="tunSSM.236.245"
2021-11-16 04:13:16.672 [INFO][57] felix/iface_monitor.go 201: Netlink address update. addr="100.64.236.245" exists=true ifIndex=65
2021-11-16 04:13:16.672 [INFO][57] felix/int_dataplane.go 1480: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"tunSSM.236.245", Addrs:set.mapSet{"100.64.236.245":set.empty{}}}
2021-11-16 04:13:16.672 [INFO][57] felix/hostip_mgr.go 84: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"tunSSM.236.245", Addrs:set.mapSet{"100.64.236.245":set.empty{}}}
2021-11-16 04:13:16.672 [INFO][57] felix/ipsets.go 126: Queueing IP set for creation family="inet" setID="this-host" setType="hash:ip"
2021-11-16 04:13:16.686 [INFO][57] felix/ipsets.go 760: Doing full IP set rewrite family="inet" numMembersInPendingReplace=5 setID="this-host"
2021-11-16 04:13:16.990 [INFO][57] felix/int_dataplane.go 1039: Linux interface addrs changed. addrs=set.mapSet{"100.64.236.245":set.empty{}, "fe80::ee1e:3c1a:1919:25ed":set.empty{}} ifaceName="tunSSM.236.245"
2021-11-16 04:13:16.990 [INFO][57] felix/iface_monitor.go 201: Netlink address update. addr="fe80::ee1e:3c1a:1919:25ed" exists=true ifIndex=65
2021-11-16 04:13:16.990 [INFO][57] felix/int_dataplane.go 1004: Linux interface state changed. ifIndex=65 ifaceName="tunSSM.236.245" state="up"
2021-11-16 04:13:16.990 [INFO][57] felix/int_dataplane.go 1480: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"tunSSM.236.245", Addrs:set.mapSet{"100.64.236.245":set.empty{}, "fe80::ee1e:3c1a:1919:25ed":set.empty{}}}
2021-11-16 04:13:16.991 [INFO][57] felix/hostip_mgr.go 84: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"tunSSM.236.245", Addrs:set.mapSet{"100.64.236.245":set.empty{}, "fe80::ee1e:3c1a:1919:25ed":set.empty{}}}
2021-11-16 04:13:16.991 [INFO][57] felix/ipsets.go 126: Queueing IP set for creation family="inet" setID="this-host" setType="hash:ip"
2021-11-16 04:13:16.991 [INFO][57] felix/ipsets.go 760: Doing full IP set rewrite family="inet" numMembersInPendingReplace=5 setID="this-host"
2021-11-16 04:13:17.010 [INFO][57] felix/int_dataplane.go 1462: Received interface update msg=&intdataplane.ifaceUpdate{Name:"tunSSM.236.245", State:"up", Index:65}
2021-11-16 04:13:50.969 [INFO][57] felix/summary.go 100: Summarising 15 dataplane reconciliation loops over 1m3.6s: avg=17ms longest=52ms (resync-nat-v4)
2021-11-16 04:14:14.546 [INFO][58] monitor-addresses/startup.go 774: Using autodetected IPv4 address on interface tunSSM.236.245: 100.64.236.245/32
2021-11-16 04:14:14.546 [INFO][58] monitor-addresses/startup.go 591: Node IPv4 changed, will check for conflicts
2021-11-16 04:14:14.561 [WARNING][58] monitor-addresses/startup.go 1123: IPv4 address has changed. This could happen if there are multiple nodes with the same name. node="ip-10-231-134-62.ec2.internal" original="10.231.134.62" updated="100.64.236.245"
2021-11-16 04:14:14.581 [INFO][58] monitor-addresses/startup.go 319: Updated node IP addresses
2021-11-16 04:14:14.583 [INFO][65] confd/client.go 897: Recompute BGP peerings: HostBGPConfig(node=ip-10-231-134-62.ec2.internal; name=ip_addr_v4) updated; HostBGPConfig(node=ip-10-231-134-62.ec2.internal; name=network_v4) updated
2021-11-16 04:14:14.587 [INFO][57] felix/int_dataplane.go 1447: Received *proto.HostMetadataUpdate update from calculation graph msg=hostname:"ip-10-231-134-62.ec2.internal" ipv4_addr:"100.64.236.245" 
2021-11-16 04:14:14.588 [INFO][57] felix/ipip_mgr.go 222: All-hosts IP set out-of sync, refreshing it.
2021-11-16 04:14:14.588 [INFO][57] felix/ipsets.go 126: Queueing IP set for creation family="inet" setID="all-hosts-net" setType="hash:net"
2021-11-16 04:14:14.589 [INFO][57] felix/ipsets.go 760: Doing full IP set rewrite family="inet" numMembersInPendingReplace=4 setID="all-hosts-net"
bird: Mesh_10_231_132_23: Received: Peer de-configured
bird: Mesh_10_231_132_23: State changed to stop
2021-11-16 04:14:14.595 [INFO][65] confd/resource.go 277: Target config /etc/calico/confd/config/bird.cfg has been updated due to change in key: /calico/bgp/v1/host
bird: Reconfiguration requested by SIGHUP
bird: Reconfiguring
bird: device1: Reconfigured
bird: direct1: Reconfigured
bird: Restarting protocol Mesh_10_231_132_23
bird: Restarting protocol Mesh_10_231_133_31
bird: Mesh_10_231_133_31: Shutting down
bird: Mesh_10_231_133_31: State changed to stop
bird: Restarting protocol Mesh_10_231_134_221
bird: Mesh_10_231_134_221: Shutting down
bird: Mesh_10_231_134_221: State changed to stop
bird: Mesh_10_231_132_23: State changed to down
bird: Mesh_10_231_132_23: Initializing
bird: Mesh_10_231_132_23: Starting
bird: Mesh_10_231_132_23: State changed to start
bird: Mesh_10_231_133_31: State changed to down
bird: Mesh_10_231_133_31: Initializing
bird: Mesh_10_231_133_31: Starting
bird: Mesh_10_231_133_31: State changed to start
bird: Mesh_10_231_134_221: State changed to down
bird: Mesh_10_231_134_221: Initializing
bird: Mesh_10_231_134_221: Starting
bird: Mesh_10_231_134_221: State changed to start
bird: Reconfigured
bird: Reconfiguration requested by SIGHUP
bird: Reconfiguring
bird: device1: Reconfigured
bird: direct1: Reconfigured
bird: Reconfigured
2021-11-16 04:14:14.601 [INFO][65] confd/resource.go 277: Target config /etc/calico/confd/config/bird6.cfg has been updated due to change in key: /calico/bgp/v1/host
2021-11-16 04:14:26.525 [INFO][57] felix/watchercache.go 96: Watch channel closed by remote - recreate watcher ListRoot="/calico/resources/v3/projectcalico.org/networksets"
2021-11-16 04:14:45.445 [WARNING][57] felix/table.go 652: Detected out-of-sync inserts, marking for resync actualRuleIDs=[]string{"", "O3lYWMrLQYEMJtB5", ""} chainName="POSTROUTING" expectedRuleIDs=[]string{"O3lYWMrLQYEMJtB5", "", ""} ipVersion=0x4 table="nat"
2021-11-16 04:14:53.544 [INFO][57] felix/summary.go 100: Summarising 13 dataplane reconciliation loops over 1m2.6s: avg=25ms longest=105ms (resync-nat-v4,update-nat-v4)
2021-11-16 04:15:14.582 [INFO][58] monitor-addresses/startup.go 774: Using autodetected IPv4 address on interface tunSSM.236.245: 100.64.236.245/32

This is strange enough on its own, something thinks the node is now.. a pod? That's a pod IP.
But then about 15 minutes later, the IP changes back and:

2021-11-16 04:32:14.600 [INFO][58] monitor-addresses/startup.go 774: Using autodetected IPv4 address on interface tunSSM.236.245: 100.64.236.245/32
2021-11-16 04:32:34.738 [INFO][57] felix/summary.go 100: Summarising 12 dataplane reconciliation loops over 1m0.2s: avg=9ms longest=27ms (resync-nat-v4)
# And here we go again:
2021-11-16 04:32:44.952 [INFO][57] felix/int_dataplane.go 1004: Linux interface state changed. ifIndex=65 ifaceName="tunSSM.236.245" state="down"
2021-11-16 04:32:44.952 [INFO][57] felix/int_dataplane.go 1462: Received interface update msg=&intdataplane.ifaceUpdate{Name:"tunSSM.236.245", State:"down", Index:65}
2021-11-16 04:32:44.952 [INFO][57] felix/int_dataplane.go 1039: Linux interface addrs changed. addrs=set.mapSet{} ifaceName="tunSSM.236.245"
2021-11-16 04:32:44.952 [INFO][57] felix/iface_monitor.go 201: Netlink address update. addr="fe80::ee1e:3c1a:1919:25ed" exists=false ifIndex=65
2021-11-16 04:32:44.952 [INFO][57] felix/int_dataplane.go 1480: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"tunSSM.236.245", Addrs:set.mapSet{}}
2021-11-16 04:32:44.952 [INFO][57] felix/hostip_mgr.go 84: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"tunSSM.236.245", Addrs:set.mapSet{}}
2021-11-16 04:32:44.952 [INFO][57] felix/ipsets.go 126: Queueing IP set for creation family="inet" setID="this-host" setType="hash:ip"
2021-11-16 04:32:44.953 [INFO][57] felix/ipsets.go 760: Doing full IP set rewrite family="inet" numMembersInPendingReplace=4 setID="this-host"
2021-11-16 04:32:44.983 [INFO][57] felix/iface_monitor.go 201: Netlink address update. addr="100.64.236.245" exists=false ifIndex=65
2021-11-16 04:32:44.983 [INFO][57] felix/int_dataplane.go 1039: Linux interface addrs changed. addrs=<nil> ifaceName="tunSSM.236.245"
2021-11-16 04:32:44.983 [INFO][57] felix/int_dataplane.go 1480: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"tunSSM.236.245", Addrs:set.Set(nil)}
2021-11-16 04:32:44.983 [INFO][57] felix/hostip_mgr.go 84: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"tunSSM.236.245", Addrs:set.Set(nil)}
2021-11-16 04:32:44.983 [INFO][57] felix/ipsets.go 126: Queueing IP set for creation family="inet" setID="this-host" setType="hash:ip"
2021-11-16 04:32:44.984 [INFO][57] felix/ipsets.go 760: Doing full IP set rewrite family="inet" numMembersInPendingReplace=4 setID="this-host"
bird: Mesh_10_231_134_221: Socket error: bind: Address not available
bird: Mesh_10_231_133_31: Socket error: bind: Address not available
bird: Mesh_10_231_132_23: Socket error: bind: Address not available
# --- snip ---
2021-11-16 04:33:14.601 [INFO][58] monitor-addresses/startup.go 774: Using autodetected IPv4 address on interface eth0: 10.231.134.62/24
2021-11-16 04:33:14.601 [INFO][58] monitor-addresses/startup.go 591: Node IPv4 changed, will check for conflicts
2021-11-16 04:33:14.614 [WARNING][58] monitor-addresses/startup.go 1123: IPv4 address has changed. This could happen if there are multiple nodes with the same name. node="ip-10-231-134-62.ec2.internal" original="100.64.236.245" updated="10.231.134.62"
2021-11-16 04:33:14.623 [WARNING][58] monitor-addresses/node.go 66: Operation Create is not supported on Node type
2021-11-16 04:33:14.624 [ERROR][58] monitor-addresses/startup.go 322: Unable to set node resource configuration, retrying... error=operation Create is not supported on Node(ip-10-231-134-62.ec2.internal)
2021-11-16 04:33:14.629 [WARNING][58] monitor-addresses/node.go 66: Operation Create is not supported on Node type
2021-11-16 04:33:14.629 [ERROR][58] monitor-addresses/startup.go 322: Unable to set node resource configuration, retrying... error=operation Create is not supported on Node(ip-10-231-134-62.ec2.internal)
2021-11-16 04:33:14.633 [WARNING][58] monitor-addresses/node.go 66: Operation Create is not supported on Node type
2021-11-16 04:33:14.633 [ERROR][58] monitor-addresses/startup.go 322: Unable to set node resource configuration, retrying... error=operation Create is not supported on Node(ip-10-231-134-62.ec2.internal)
bird: Mesh_10_231_132_23: Socket error: bind: Address not available
# --- snip ---

The pod will then forever give us the bird, logs. Endlessly unable to bind until the pod is deleted.

Impact

The worker node essentially loses all connectivity and causes pods to start timing out and retrying until a route works. This causes random health failures and API issues. If the broken worker node is running something critical that isn't HA things go further south.

Expected Behavior

Calico should handle strange network changes either more gracefully, or less gracefully (ie: crash the pod).

Environment

Calico version: docker.io/calico/cni:v3.19.1
EKS with calico as the CNI

The text was updated successfully, but these errors were encountered:

Timbus · 2021-11-17T00:03:44Z

Poking around while writing this wall of text I figured out "tunSSM" is 'ssm-tunnel' which is some janky way to get into the worker that a team member added. So that's probably the cause since it looks like a network hack.

I'll be removing ssm-tunnel but I believe calico shouldn't get stuck like it does. I see someone else on stackoverflow with the exact same issue being caused by a gitlab runner which probably does a similar thing so.. Yeah. Leaving this open

caseydavenport · 2021-11-19T20:55:37Z

It sounds like something else is creating an interface that Calico is preferring? If that's the case, you can configure Calico's autodetection method to ignore that interface. There's also a WIP PR that might help this which allows configuring Calico to use the Kubernetes node IP rather than detecting from the host itself: projectcalico/node#1242

Timbus · 2021-11-19T21:38:32Z

Yes, that PR would be an excellent way to prevent future issues, but I would rather not focus so much on the cause..
I just think that the pod endlessly trying to start bird and failing should be a fatal error at some point. It's clearly in some broken state.. I cut at least 20,000 lines of 'bind: Address not available' from the bottom of the logs.
Worst part is that the pod only needs to restart, to fix the issue.

caseydavenport · 2021-11-24T01:22:21Z

@Timbus do you not have this section in your calico/node daemonset?

          livenessProbe:
            exec:
              command:
              - /bin/calico-node
              - -felix-live
              - -bird-live
            periodSeconds: 10
            initialDelaySeconds: 10
            failureThreshold: 6
            timeoutSeconds: 10

It's intended to monitor BIRD health and report not-live if BIRD is not running. This will trigger Kubernetes to restart the calico/node pod after a minute.

caseydavenport · 2021-11-24T01:22:53Z

Or, perhaps our bird liveness check is not fine-grained enough to spot this issue.

Timbus · 2021-11-24T02:44:49Z

Yep, in the template:

Liveness:   exec [/bin/calico-node -felix-live -bird-live] delay=10s timeout=1s period=10s #success=1 #failure=6
Readiness:  exec [/bin/calico-node -felix-ready -bird-ready] delay=0s timeout=1s period=10s #success=1 #failure=3

I'm not sure if bird is actually crashing in the container or just complaining it cannot bind. I'm not sure how the container manages the processes. I would guess it's still alive, though the container is unready which helped me find the issue.

caseydavenport added the kind/support label Nov 19, 2021

caseydavenport mentioned this issue Nov 30, 2021

New autodetection method KubernetesInternalIP added projectcalico/node#1242

Merged

caseydavenport closed this as completed in projectcalico/node#1242 Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node 'changes IP' and causes loss of connectivity #5090

Node 'changes IP' and causes loss of connectivity #5090

Timbus commented Nov 16, 2021

Timbus commented Nov 17, 2021

caseydavenport commented Nov 19, 2021

Timbus commented Nov 19, 2021

caseydavenport commented Nov 24, 2021

caseydavenport commented Nov 24, 2021

Timbus commented Nov 24, 2021

Node 'changes IP' and causes loss of connectivity #5090

Node 'changes IP' and causes loss of connectivity #5090

Comments

Timbus commented Nov 16, 2021

Current Behavior / Issue:

Impact

Expected Behavior

Environment

Timbus commented Nov 17, 2021

caseydavenport commented Nov 19, 2021

Timbus commented Nov 19, 2021

caseydavenport commented Nov 24, 2021

caseydavenport commented Nov 24, 2021

Timbus commented Nov 24, 2021