[BUG] Nodes become unreachable when using Cilium #3531

maybedino · 2023-03-13T12:11:13Z

Describe the bug
I set up an AKS cluster with Azure CNI + Cilium, and 1 Standard_D4plds_v5 (ARM) node. After a few hours or up to a few days, the node becomes unreachable. It shows "not ready" in the Portal, and can't be reached over network. I can't get any logs out of it either.

To Reproduce
Create the cluster:

az group create --name test-cilium-debug --location westeurope

az aks create -n aks-ciliumdebug-westeu -g test-cilium-debug -l westeurope \
  --network-plugin azure \
  --network-plugin-mode overlay \
  --pod-cidr 192.168.0.0/16 \
  --enable-cilium-dataplane \
  --node-vm-size Standard_D4plds_v5 \
  --node-count 1

Wait at least a few minutes, up to a few days. Most of the time it takes a few hours but sometimes more or less.

Check the node status, it shows "not ready" and it's no longer possible to connect to the node (pings, ssh, etc.).

So far this has been 100% reproducible for me, after trying ~5 times to to rule out other factors. Sometimes it just takes longer but the node will always fail eventually.

Expected behavior
The node should work normally.

Environment (please complete the following information):

CLI Version - 2.45.0
Kubernetes version - Tested both 1.24.x (default) and 1.25.x
CLI Extension version - AKS preview 0.5.129

rafaribe · 2023-03-13T23:31:44Z

Same happened to me in multiple configurations, either with Preview Cilium Dataplane and with BYOCNI.
Logs and kubectl describe point only to Kubelet problems, nothing with Cilium, one of the triggers to make this happen is to create a cluster on 1.25 and then upgrade to 1.26.0 (preview).

sabbour · 2023-03-14T02:23:33Z

@phealy @chasewilson can you please take a look?

mark-angelo · 2023-03-14T08:02:25Z

This issue is not present in AKS 1.23, which is being deprecated at the end of this month. We are therefore unable to upgrade to the new AKS version(s).

If this issue is not resolved promptly, we kindly ask that the deprecation timeline for AKS 1.23 be extended. Thank you!

twendt · 2023-03-14T19:33:28Z

The issue is caused by the latest systemd update. The latest node image has an older version installed. After the node comes up, it will be patched by the unattended updates that is enabled in Ubuntu. During the installation of the update, the node goes into NotReady state.
Unfortunately the PR for the new node image also includes the 249.11-0ubuntu3.6 version of systemd.

It would be great if MS could provide a new node image that already includes the latest systemd package with the 249.11-0ubuntu3.7 version.

What helps is to restart, not re-image, the node. Of course all nodes that will be added by the cluster autoscaler will also run into this issue.

The solution would be to use the new OS upgrade feature, but that is still in Public Preview.

Davee02 · 2023-03-15T09:45:05Z

We're experiencing the exact same problem, AKS 1.25.4 with node pool image AKSUbuntu-2204gen2containerd-2023.02.15 and Cilium 1.12.3

atykhyy · 2023-03-15T13:44:22Z

One can reproduce the issue deterministically, without waiting for an indefinite period. Create a one-node AKS cluster with a recent version (I used 1.25.4) and Cilium dataplane, SSH into the node, and run DEBIAN_FRONTEND=noninteractive sudo apt upgrade systemd -y. In my case this upgrades systemd from 249.11-0ubuntu3.6 to 249.11-0ubuntu3.7. At some point in the upgrade process the SSH connection freezes and the node transitions to Not Ready state. Manually rebooting a node instance stuck in Not Ready state cures it, so the problem appears to be with the upgrade process rather than with the new systemd itself.

A simple temporary fix for the issue is to run sudo apt-mark hold systemd on every fresh node. Another simple workaround is to create a daemonset which will pin the systemd package on every node.

PS: the DNS resolution issue (tracking ID 2TWN-VT0) which affected a large number of Azure VMs and AKS clusters using Ubuntu last August was also caused by a faulty systemd upgrade.

mark-angelo · 2023-03-15T21:46:37Z

@phealy @chasewilson we would be grateful if you could share an update as this is impacting our ability to upgrade from 1.23. Thank you!

phealy · 2023-03-16T13:52:55Z

Tagging @wedaly on this and raising it internally - we'll look at this right away.

ghost · 2023-03-16T13:53:15Z

@aanandr, @phealy would you be able to assist?

Issue Details

Describe the bug
I set up an AKS cluster with Azure CNI + Cilium, and 1 Standard_D4plds_v5 (ARM) node. After a few hours or up to a few days, the node becomes unreachable. It shows "not ready" in the Portal, and can't be reached over network. I can't get any logs out of it either.

To Reproduce
Create the cluster:

az group create --name test-cilium-debug --location westeurope

az aks create -n aks-ciliumdebug-westeu -g test-cilium-debug -l westeurope \
  --network-plugin azure \
  --network-plugin-mode overlay \
  --pod-cidr 192.168.0.0/16 \
  --enable-cilium-dataplane \
  --node-vm-size Standard_D4plds_v5 \
  --node-count 1

Wait at least a few minutes, up to a few days. Most of the time it takes a few hours but sometimes more or less.

Check the node status, it shows "not ready" and it's no longer possible to connect to the node (pings, ssh, etc.).

So far this has been 100% reproducible for me, after trying ~5 times to to rule out other factors. Sometimes it just takes longer but the node will always fail eventually.

Expected behavior
The node should work normally.

Environment (please complete the following information):

CLI Version - 2.45.0
Kubernetes version - Tested both 1.24.x (default) and 1.25.x
CLI Extension version - AKS preview 0.5.129

Author:	maybedino
Assignees:	-
Labels:	`bug`, `networking/azcni`
Milestone:	-

phealy · 2023-03-16T15:20:00Z

OK, this is coming from a known issue with Cilium and Systemd - Cilium adds routes with proto static, and systemd 249 (in Ubuntu 22.04) has a setting that's on by default - networkd thinks it owns all routing on the system, and will remove any routes that aren't placed by it on a package restart.

The easiest temporary fix would be to use the NodeOSUpgrade preview feature to disable unattended-upgrade by setting the nodes to "none", "securitypatch", or "nodeimage" (really, anything other than "unmanaged"). This will prevent systemd-networkd restarting and removing the routes when unattended-upgrade runs.

atykhyy · 2023-03-28T09:43:38Z

Thank you @phealy! We were able to use that PR to solve our specific problem with Cilium on AKS. This issue, too, should go away once --enable-cilium-dataplane switches to the upcoming 1.13.2 planned to include that PR.

May 13: this is the correct PR to use for a private build of Cilium: cilium/cilium#25350

ghost · 2023-04-27T16:00:49Z

Action required from @Azure/aks-pm

ghost · 2023-05-12T18:00:47Z

Issue needing attention of @Azure/aks-leads

ghost · 2023-05-28T12:00:50Z

Issue needing attention of @Azure/aks-leads

atykhyy · 2023-05-31T12:34:04Z

Cilium changes which fix this issue have been merged into their main branch and will almost certainly be included in their soon-to-come 1.14.0 release. Reading the discussion in the Cilium PR thread linked above, the backport of these changes into 1.13.x is likely to take a while longer because upgrade scenarios require additional testing and possibly coding, and 1.12.x is unlikely to be attempted unless perhaps as a community PR. --enable-cilium-dataplane currently uses 1.12.8 (I've just checked this).

wedaly · 2023-05-31T14:30:30Z

1.12.x is unlikely to be attempted unless perhaps as a community PR. --enable-cilium-dataplane currently uses 1.12.8 (I've just checked this).

For Azure CNI powered by Cilium we added an init container to the Cilium daemonset to configure systemd-networkd with the mitigation suggested in cilium/cilium#18706 (comment)

[Network]
ManageForeignRoutes=no
ManageForeignRoutingPolicyRules=no

This restores the behavior of systemd-networkd before the unattended update that caused this issue.

maybedino added the bug label Mar 13, 2023

phealy added the networking/azcni label Mar 16, 2023

ghost added the action-required label Apr 22, 2023

ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Apr 27, 2023

ghost removed action-required Needs Attention 👋 Issues needs attention/assignee/owner labels May 31, 2023

wedaly closed this as completed May 31, 2023

rbrtbnfgl mentioned this issue Jun 16, 2023

install script breaks networking when using cilium with systemd >= 249 k3s-io/k3s#7736

Closed

ghost locked as resolved and limited conversation to collaborators Jun 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Nodes become unreachable when using Cilium #3531

[BUG] Nodes become unreachable when using Cilium #3531

maybedino commented Mar 13, 2023

rafaribe commented Mar 13, 2023 •

edited

Loading

sabbour commented Mar 14, 2023

mark-angelo commented Mar 14, 2023 •

edited

Loading

twendt commented Mar 14, 2023

Davee02 commented Mar 15, 2023

atykhyy commented Mar 15, 2023 •

edited

Loading

mark-angelo commented Mar 15, 2023

phealy commented Mar 16, 2023

ghost commented Mar 16, 2023

phealy commented Mar 16, 2023

atykhyy commented Mar 28, 2023 •

edited

Loading

ghost commented Apr 27, 2023

ghost commented May 12, 2023

ghost commented May 28, 2023

atykhyy commented May 31, 2023

wedaly commented May 31, 2023

[BUG] Nodes become unreachable when using Cilium #3531

[BUG] Nodes become unreachable when using Cilium #3531

Comments

maybedino commented Mar 13, 2023

rafaribe commented Mar 13, 2023 • edited Loading

sabbour commented Mar 14, 2023

mark-angelo commented Mar 14, 2023 • edited Loading

twendt commented Mar 14, 2023

Davee02 commented Mar 15, 2023

atykhyy commented Mar 15, 2023 • edited Loading

mark-angelo commented Mar 15, 2023

phealy commented Mar 16, 2023

ghost commented Mar 16, 2023

phealy commented Mar 16, 2023

atykhyy commented Mar 28, 2023 • edited Loading

ghost commented Apr 27, 2023

ghost commented May 12, 2023

ghost commented May 28, 2023

atykhyy commented May 31, 2023

wedaly commented May 31, 2023

rafaribe commented Mar 13, 2023 •

edited

Loading

mark-angelo commented Mar 14, 2023 •

edited

Loading

atykhyy commented Mar 15, 2023 •

edited

Loading

atykhyy commented Mar 28, 2023 •

edited

Loading