Calico role fails migration from ipip to vxlan mode #8691

ledroide · 2022-04-06T12:41:20Z

Effects

Networking inside Kubernetes (pods, services, etc.) does not work anymore after upgrading to Kubernetes 1.23.5 with kubespray at commit id 0481dd9.

Symptoms

Summary :

Many IP addresses from Services and pods are unreachable between nodes.
Restarting DaemonSet/calico-node does not solve
Restarting the Kubelet globally does not solve
Rebooting nodes has no effect
TCP port 179 answers normally to other nodes when Calico process is running
Calico can't see other nodes

# sudo /usr/local/bin/calicoctl node status
Calico process is running.
None of the BGP backend processes (BIRD or GoBGP) are running.

logs from calico-node claim that no vxlan.calico interface exists

calico-node-746kz calico-node 2022-04-06 10:20:56.345 [ERROR][65] felix/route_table.go 951: Failed to get link attributes error=interface not present ifaceRegex="^vxlan.calico$" ipVersion=0x4 tableIndex=0
calico-node-746kz calico-node 2022-04-06 10:20:56.345 [INFO][65] felix/route_table.go 558: Interface missing, will retry if it appears. ifaceName="vxlan.calico" ifaceRegex="^vxlan.calico$" ipVersion=0x4 tableIndex=0
calico-node-wdjb9 calico-node 2022-04-06 10:20:56.348 [INFO][67] felix/route_table.go 1116: Failed to access interface because it doesn't exist. error=Link not found ifaceName="vxlan.calico" ifaceRegex="^vxlan.calico$" ipVersion=0x4 tableIndex=0
calico-node-wdjb9 calico-node 2022-04-06 10:20:56.348 [INFO][67] felix/route_table.go 1184: Failed to get interface; it's down/gone. error=Link not found ifaceName="vxlan.calico" ifaceRegex="^vxlan.calico$" ipVersion=0x4 tableIndex=0
calico-node-wdjb9 calico-node 2022-04-06 10:20:56.348 [ERROR][67] felix/route_table.go 951: Failed to get link attributes error=interface not present ifaceRegex="^vxlan.calico$" ipVersion=0x4 tableIndex=0

There is no interface like "^vxlan.calico$" on the nodes.

Versions

Kubespray commit id 0481dd9 (previsouly fetched 672e47a that was OK, so the error appears between 672e47a..0481dd9)
quay.io/calico/node:v3.21.4
Kubernetes 1.23.5
CRI-o 1.23.2

Workaround

Set back ipip mode as documented in docs/calico.md, and run cluster.yml playbook.

Here is my group_vars/k8s_cluster/k8s-net-calico.yaml configuration (just added 3 last lines) :

calico_datastore: kdd
calico_node_livenessprobe_timeout: 11
calico_node_readinessprobe_timeout: 11
calico_ipip_mode: Always
calico_vxlan_mode: Never
calico_network_backend: bird

You can check with calicoctl that calico works again :

# sudo /usr/local/bin/calicoctl node status
Calico process is running.
IPv4 BGP status
+----------------+-------------------+-------+----------+-------------+
|  PEER ADDRESS  |     PEER TYPE     | STATE |  SINCE   |    INFO     |
+----------------+-------------------+-------+----------+-------------+
| 10.150.232.209 | node-to-node mesh | up    | 11:45:36 | Established |
| 172.16.64.150  | node-to-node mesh | up    | 11:46:16 | Established |
| 10.150.233.51  | node-to-node mesh | up    | 11:45:35 | Established |
| 10.150.233.52  | node-to-node mesh | up    | 11:45:34 | Established |
| 10.150.233.53  | node-to-node mesh | up    | 11:45:34 | Established |
| 10.150.233.54  | node-to-node mesh | up    | 11:45:35 | Established |
| 10.150.233.42  | node-to-node mesh | up    | 11:45:34 | Established |
| 10.150.233.43  | node-to-node mesh | up    | 11:45:38 | Established |
+----------------+-------------------+-------+----------+-------------+

Assumption

The issue raises with this change :

$ git show dd2d95e --name-only
commit dd2d95ecdf5e25db2433e7b10132844b60dbe619
Author: Cristian Calin <[email protected]>
Date:   Fri Mar 18 03:05:39 2022 +0200
    [calico] don't enable ipip encapsulation by default and use vxlan in CI (#8434)
    * [calico] make vxlan encapsulation the default
    * don't enable ipip encapsulation by default
    * set calico_network_backend by default to vxlan
    * update sample inventory and documentation
    * [CI] pin default calico parameters for upgrade tests to ensure proper upgrade
    * [CI] improve netchecker connectivity testing
    * [CI] show logs for tests
    * [calico] tweak task name
    * [CI] Don't run the provisioner from vagrant since we run it in testcases_run.sh
    * [CI] move kube-router tests to vagrant to avoid network connectivity issues during netchecker check
    * service proxy mode still fails connectivity tests so keeping it manual mode
    * [kube-router] account for containerd use-case
docs/calico.md
docs/setting-up-your-first-cluster.md
docs/vars.md
inventory/sample/group_vars/k8s_cluster/k8s-net-calico.yml
roles/kubernetes/preinstall/tasks/0020-verify-settings.yml
roles/network_plugin/calico/defaults/main.yml
roles/network_plugin/calico/tasks/check.yml
roles/network_plugin/calico/tasks/install.yml
roles/network_plugin/calico/templates/calico-config.yml.j2
roles/network_plugin/calico/templates/calico-node.yml.j2
roles/network_plugin/kube-router/templates/kube-router.yml.j2
(...)

What is expected

Moving from one default to another should embed a migration script or guide.

If vxlan mode becomes the recommended mode, then I would like to migrate.

Unfortunately, there is nothing that :

warns me that this change will break my cluster network layer
or takes care of the migration process
or creates the required vxlan.calico interfaces on the nodes
or gives instructions in such case

The text was updated successfully, but these errors were encountered:

cristicalin · 2022-04-06T17:35:04Z

@ledroide note that you are deploying from an unreleased in-development branch, the release notes will contain guidelines about this breaking change in the defaults and what flags will need to be put in your ansible inventory to ensure existing deployments are not broken.

Moving from one encapsulation to another is not quite straight-forward but it had to be done to work around issues we detected with ipip out of the box. The new vxlan default is considered a future proof approach for long term sustainability of the project and existing deployments will have to put in backwards compatible flags to retain the old behaviour.

ledroide · 2022-04-07T07:04:07Z

@cristicalin

Moving from one encapsulation to another is not quite straight-forward but it had to be done to work around issues

This is basically the purpose of this issue. I'm available to test further enhancements for the calico role, until it does not trigger network breakdown to the existing clusters. Thanks for your answer.
Serge

cristicalin · 2022-04-08T08:03:54Z

Currently a defaults to defaults migration should fail in the validation stage if you have not configured your encapsulation parameters to match the existing environment.

TASK [network_plugin/calico : Check if inventory match current cluster configuration] ******************* 
task path: /root/kubespray/kubespray/roles/network_plugin/calico/tasks/check.yml:52                     
fatal: [kube-1]: FAILED! => {                                                                            
    "assertion": "not calico_pool_conf.spec.ipipMode is defined or calico_pool_conf.spec.ipipMode == calico_ipip_mode",
    "changed": false,                                                                                    
    "evaluated_to": false,                                                                               
    "msg": "Your inventory doesn't match the current cluster configuration"                             
}                                                                                                        
                                                                                                         
NO MORE HOSTS LEFT ************************************************************************************** 

PLAY RECAP ********************************************************************************************** 
kube-1                     : ok=1443 changed=184  unreachable=0    failed=1    skipped=1123 rescued=0    ignored=1   
kube-2                     : ok=738  changed=104  unreachable=0    failed=0    skipped=780  rescued=0    ignored=1   
kube-3                     : ok=525  changed=78   unreachable=0    failed=0    skipped=309  rescued=0    ignored=0   
localhost                  : ok=3    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0

Does this not happen in your environment?

The migration procedure is manual at the moment and not covered by kubespay code.

Perform the migration while running kubespray 2.18.x:

calicoctl patch felixconfig default -p '{"spec":{"vxlanEnabled":true}}'
calicoctl patch ippool default-pool -p '{"spec":{"ipipMode":"Never", "vxlanMode":"Always"}}'   ## wait for the vxlan.calico interface to be created and traffic to be routed through it
calicoctl patch felixconfig default -p '{"spec":{"ipipEnabled":false}}'

Run the cluster upgrade at which point you should no longer experience a traffic interruption.

cristicalin · 2022-04-08T17:14:24Z

It seems our check is actually very late in the process so I can see why this would break existing clusters, a simple fix would be to move this check much earlier in the validation phase and stop the upgrade before it breaks.

Looking at the changes the playbook does before running roles/network_plugin/calico/tasks/check.yml it doesn't look like any changes are actually done to the cluster.

@ledroide can you share an ansible log with -vvv ?

cristicalin · 2022-04-08T17:39:19Z

Some more info, commenting out the validation task in roles/network_plugin/calico/tasks/check.yml allows the playbook to complete but our logic does not modify the default-pool calico ippool to change the encapsulation which causes the traffic outage.

cristicalin · 2022-04-09T18:40:56Z

While we could update felixconfig and the ippool during the upgrade there is still an issue with recycling the calico-node pods when calico_network_backend changes. Right now this is mapped from a configmap kube-system/calico-config and the playbook does update the configmap but it does not recycle the calico-node pods this should not be an issue moving from ipip to vxlan since vxlan can continue to use bird backend but it would be an issue making this kind of change generic. If we end up allowing encapsulation changes through kubespray parameters then folks will star using and relying on it and might end up with broken environments if we don't cover all possible transition scenarios.

Considering the implications here, I'm strongly inclining towards moving the sanity check early in the playbook and just stopping the execution and documenting the manual steps to be performed pre-upgrade.

/cc @floryut @oomichi what do you think?

floryut · 2022-04-12T12:26:13Z

Sorry about the lag, no brainer to move the sanity check earlier for me 👍

ledroide · 2022-04-13T10:11:40Z

Hello @cristicalin .
Here is how I have tested :

set back the inventory specs in group_vars/k8s_cluster/k8s-net-calico.yaml ; I have removed these 3 lines from my previous workaround :

calico_ipip_mode: Always
calico_vxlan_mode: Never
calico_network_backend: bird

from the HEAD kubespray repo (master branch, commit_id aef5f1e) I have cherry-picked your changes :

git fetch [email protected]:cristicalin/kubespray.git check_calico_encapsulation_early
git cherry-pick f38cf5581ba7dbd235e5ef6fb78b95808531c223
git log --oneline

3b2e5a17 [calico] call calico checks early on to prevent altering the cluster with bad configuration
aef5f1e1 Add tz to kubespray image
3d4baea0 Add tag to AWS VPC subnets for automatic subnet discovery by load balancers or ingress controllers (#8705)

run cluster.yml as usual
check calico status and interfaces on a random node

ip addr ls | grep vxlan -> none
sudo /usr/local/bin/calicoctl node status -> None of the BGP backend processes (BIRD or GoBGP) are running

rollback

revert

$ git revert 3b2e5a173bc2241377d5bdd24f3bd1312c2e2e73
[master 37cf9a74] Revert "[calico] call calico checks early on to prevent altering the cluster with bad configuration"
 5 files changed, 99 insertions(+), 102 deletions(-)

set back ipip mode in my group_vars/k8s_cluster/k8s-net-calico.yaml file
run cluster.yml
sudo /usr/local/bin/calicoctl node status -> all nodes are up

I wanted to test without manual steps, in order to check what happens to a random guy that applies the default cluster.yml as usual.
Tell me if you want me to try an other way, or to check again after some change in your fork.

cristicalin · 2022-04-13T10:56:09Z

@ledroide could you share the log from point 3? The playbook should have stopped with an assertion error like this:

fatal: [kube-1]: FAILED! => {                                                                            
    "assertion": "not calico_pool_conf.spec.ipipMode is defined or calico_pool_conf.spec.ipipMode == calico_ipip_mode",
    "changed": false,                                                                                    
    "evaluated_to": false,                                                                               
    "msg": "Your inventory doesn't match the current cluster configuration"                             
}

cristicalin · 2022-04-13T11:56:30Z

I just re-tested on a vagrant environment upgrading from release-2.18 branch to my PR branch and I reliably get the assertion failure as expected and no interruption in traffic.

@ledroide Are you setting ignore_assert_errors=True in your ansible inventory vars?

ledroide · 2022-04-13T14:08:08Z

Are you setting ignore_assert_errors=True in your ansible inventory vars?

@cristicalin This value is not set at all in my inventory, and there was no assertion when I had run the cluster.yml playbook

cristicalin · 2022-04-13T17:39:02Z

Could you share the execution logs of ansible-playbook -vvv ?

ledroide · 2022-04-22T13:57:06Z

Hello @cristicalin
I have tested right now, following your instructions "Migrating from IP in IP to VXLAN" in calico.md
Pulled kubespray at commit 3f06591
Migration to vxlan mode looks good. However I was surprised that calicoctl node status does not show the nodes list anymore. But calicoctl get nodes -o wide shows them. I guess it's a normal behavior.

Suggestion : in calico.md, right after "IP in IP mode" and before "BGP mode" :

VXLAN mode

To configure VXLAN mode:

calico_ipip_mode: Never
calico_vxlan_mode: Always
calico_network_backend: vxlan

ledroide added the kind/bug Categorizes issue or PR as related to a bug. label Apr 6, 2022

cristicalin mentioned this issue Apr 12, 2022

[calico] call calico checks early on to prevent altering the cluster #8707

Merged

k8s-ci-robot closed this as completed in #8707 Apr 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calico role fails migration from ipip to vxlan mode #8691

Calico role fails migration from ipip to vxlan mode #8691

ledroide commented Apr 6, 2022

cristicalin commented Apr 6, 2022

ledroide commented Apr 7, 2022

cristicalin commented Apr 8, 2022

cristicalin commented Apr 8, 2022 •

edited

Loading

cristicalin commented Apr 8, 2022

cristicalin commented Apr 9, 2022

floryut commented Apr 12, 2022

ledroide commented Apr 13, 2022

cristicalin commented Apr 13, 2022

cristicalin commented Apr 13, 2022

ledroide commented Apr 13, 2022

cristicalin commented Apr 13, 2022

ledroide commented Apr 22, 2022

Calico role fails migration from ipip to vxlan mode #8691

Calico role fails migration from ipip to vxlan mode #8691

Comments

ledroide commented Apr 6, 2022

Effects

Symptoms

Versions

Workaround

Assumption

What is expected

cristicalin commented Apr 6, 2022

ledroide commented Apr 7, 2022

cristicalin commented Apr 8, 2022

cristicalin commented Apr 8, 2022 • edited Loading

cristicalin commented Apr 8, 2022

cristicalin commented Apr 9, 2022

floryut commented Apr 12, 2022

ledroide commented Apr 13, 2022

cristicalin commented Apr 13, 2022

cristicalin commented Apr 13, 2022

ledroide commented Apr 13, 2022

cristicalin commented Apr 13, 2022

ledroide commented Apr 22, 2022

VXLAN mode

cristicalin commented Apr 8, 2022 •

edited

Loading