-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Nodes become unreachable when using Cilium #3531
Comments
Same happened to me in multiple configurations, either with Preview Cilium Dataplane and with BYOCNI. |
@phealy @chasewilson can you please take a look? |
This issue is not present in AKS 1.23, which is being deprecated at the end of this month. We are therefore unable to upgrade to the new AKS version(s). If this issue is not resolved promptly, we kindly ask that the deprecation timeline for AKS 1.23 be extended. Thank you! |
The issue is caused by the latest systemd update. The latest node image has an older version installed. After the node comes up, it will be patched by the unattended updates that is enabled in Ubuntu. During the installation of the update, the node goes into NotReady state. It would be great if MS could provide a new node image that already includes the latest systemd package with the 249.11-0ubuntu3.7 version. What helps is to restart, not re-image, the node. Of course all nodes that will be added by the cluster autoscaler will also run into this issue. The solution would be to use the new OS upgrade feature, but that is still in Public Preview. |
We're experiencing the exact same problem, AKS 1.25.4 with node pool image AKSUbuntu-2204gen2containerd-2023.02.15 and Cilium 1.12.3 |
One can reproduce the issue deterministically, without waiting for an indefinite period. Create a one-node AKS cluster with a recent version (I used 1.25.4) and Cilium dataplane, SSH into the node, and run A simple temporary fix for the issue is to run PS: the DNS resolution issue (tracking ID 2TWN-VT0) which affected a large number of Azure VMs and AKS clusters using Ubuntu last August was also caused by a faulty systemd upgrade. |
@phealy @chasewilson we would be grateful if you could share an update as this is impacting our ability to upgrade from 1.23. Thank you! |
Tagging @wedaly on this and raising it internally - we'll look at this right away. |
@aanandr, @phealy would you be able to assist? Issue DetailsDescribe the bug To Reproduce az group create --name test-cilium-debug --location westeurope
az aks create -n aks-ciliumdebug-westeu -g test-cilium-debug -l westeurope \
--network-plugin azure \
--network-plugin-mode overlay \
--pod-cidr 192.168.0.0/16 \
--enable-cilium-dataplane \
--node-vm-size Standard_D4plds_v5 \
--node-count 1 Wait at least a few minutes, up to a few days. Most of the time it takes a few hours but sometimes more or less. Check the node status, it shows "not ready" and it's no longer possible to connect to the node (pings, ssh, etc.). So far this has been 100% reproducible for me, after trying ~5 times to to rule out other factors. Sometimes it just takes longer but the node will always fail eventually. Expected behavior Environment (please complete the following information):
|
OK, this is coming from a known issue with Cilium and Systemd - Cilium adds routes with proto static, and systemd 249 (in Ubuntu 22.04) has a setting that's on by default - networkd thinks it owns all routing on the system, and will remove any routes that aren't placed by it on a package restart. The easiest temporary fix would be to use the NodeOSUpgrade preview feature to disable unattended-upgrade by setting the nodes to "none", "securitypatch", or "nodeimage" (really, anything other than "unmanaged"). This will prevent systemd-networkd restarting and removing the routes when unattended-upgrade runs. |
Thank you @phealy! We were able to use that PR to solve our specific problem with Cilium on AKS. This issue, too, should go away once May 13: this is the correct PR to use for a private build of Cilium: cilium/cilium#25350 |
Action required from @Azure/aks-pm |
Issue needing attention of @Azure/aks-leads |
1 similar comment
Issue needing attention of @Azure/aks-leads |
Cilium changes which fix this issue have been merged into their main branch and will almost certainly be included in their soon-to-come 1.14.0 release. Reading the discussion in the Cilium PR thread linked above, the backport of these changes into 1.13.x is likely to take a while longer because upgrade scenarios require additional testing and possibly coding, and 1.12.x is unlikely to be attempted unless perhaps as a community PR. |
For Azure CNI powered by Cilium we added an init container to the Cilium daemonset to configure systemd-networkd with the mitigation suggested in cilium/cilium#18706 (comment)
This restores the behavior of systemd-networkd before the unattended update that caused this issue. |
Describe the bug
I set up an AKS cluster with Azure CNI + Cilium, and 1 Standard_D4plds_v5 (ARM) node. After a few hours or up to a few days, the node becomes unreachable. It shows "not ready" in the Portal, and can't be reached over network. I can't get any logs out of it either.
To Reproduce
Create the cluster:
Wait at least a few minutes, up to a few days. Most of the time it takes a few hours but sometimes more or less.
Check the node status, it shows "not ready" and it's no longer possible to connect to the node (pings, ssh, etc.).
So far this has been 100% reproducible for me, after trying ~5 times to to rule out other factors. Sometimes it just takes longer but the node will always fail eventually.
Expected behavior
The node should work normally.
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: