Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico role fails migration from ipip to vxlan mode #8691

Closed
ledroide opened this issue Apr 6, 2022 · 13 comments · Fixed by #8707
Closed

Calico role fails migration from ipip to vxlan mode #8691

ledroide opened this issue Apr 6, 2022 · 13 comments · Fixed by #8707
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@ledroide
Copy link
Contributor

ledroide commented Apr 6, 2022

Effects

Networking inside Kubernetes (pods, services, etc.) does not work anymore after upgrading to Kubernetes 1.23.5 with kubespray at commit id 0481dd9.

Symptoms

Summary :

  • Many IP addresses from Services and pods are unreachable between nodes.
  • Restarting DaemonSet/calico-node does not solve
  • Restarting the Kubelet globally does not solve
  • Rebooting nodes has no effect
  • TCP port 179 answers normally to other nodes when Calico process is running
  • Calico can't see other nodes
# sudo /usr/local/bin/calicoctl node status
Calico process is running.
None of the BGP backend processes (BIRD or GoBGP) are running.
  • logs from calico-node claim that no vxlan.calico interface exists
calico-node-746kz calico-node 2022-04-06 10:20:56.345 [ERROR][65] felix/route_table.go 951: Failed to get link attributes error=interface not present ifaceRegex="^vxlan.calico$" ipVersion=0x4 tableIndex=0
calico-node-746kz calico-node 2022-04-06 10:20:56.345 [INFO][65] felix/route_table.go 558: Interface missing, will retry if it appears. ifaceName="vxlan.calico" ifaceRegex="^vxlan.calico$" ipVersion=0x4 tableIndex=0
calico-node-wdjb9 calico-node 2022-04-06 10:20:56.348 [INFO][67] felix/route_table.go 1116: Failed to access interface because it doesn't exist. error=Link not found ifaceName="vxlan.calico" ifaceRegex="^vxlan.calico$" ipVersion=0x4 tableIndex=0
calico-node-wdjb9 calico-node 2022-04-06 10:20:56.348 [INFO][67] felix/route_table.go 1184: Failed to get interface; it's down/gone. error=Link not found ifaceName="vxlan.calico" ifaceRegex="^vxlan.calico$" ipVersion=0x4 tableIndex=0
calico-node-wdjb9 calico-node 2022-04-06 10:20:56.348 [ERROR][67] felix/route_table.go 951: Failed to get link attributes error=interface not present ifaceRegex="^vxlan.calico$" ipVersion=0x4 tableIndex=0
  • There is no interface like "^vxlan.calico$" on the nodes.

Versions

  • Kubespray commit id 0481dd9 (previsouly fetched 672e47a that was OK, so the error appears between 672e47a..0481dd9)
  • quay.io/calico/node:v3.21.4
  • Kubernetes 1.23.5
  • CRI-o 1.23.2

Workaround

Set back ipip mode as documented in docs/calico.md, and run cluster.yml playbook.

Here is my group_vars/k8s_cluster/k8s-net-calico.yaml configuration (just added 3 last lines) :

calico_datastore: kdd
calico_node_livenessprobe_timeout: 11
calico_node_readinessprobe_timeout: 11
calico_ipip_mode: Always
calico_vxlan_mode: Never
calico_network_backend: bird

You can check with calicoctl that calico works again :

# sudo /usr/local/bin/calicoctl node status
Calico process is running.
IPv4 BGP status
+----------------+-------------------+-------+----------+-------------+
|  PEER ADDRESS  |     PEER TYPE     | STATE |  SINCE   |    INFO     |
+----------------+-------------------+-------+----------+-------------+
| 10.150.232.209 | node-to-node mesh | up    | 11:45:36 | Established |
| 172.16.64.150  | node-to-node mesh | up    | 11:46:16 | Established |
| 10.150.233.51  | node-to-node mesh | up    | 11:45:35 | Established |
| 10.150.233.52  | node-to-node mesh | up    | 11:45:34 | Established |
| 10.150.233.53  | node-to-node mesh | up    | 11:45:34 | Established |
| 10.150.233.54  | node-to-node mesh | up    | 11:45:35 | Established |
| 10.150.233.42  | node-to-node mesh | up    | 11:45:34 | Established |
| 10.150.233.43  | node-to-node mesh | up    | 11:45:38 | Established |
+----------------+-------------------+-------+----------+-------------+

Assumption

The issue raises with this change :

$ git show dd2d95e --name-only
commit dd2d95ecdf5e25db2433e7b10132844b60dbe619
Author: Cristian Calin <[email protected]>
Date:   Fri Mar 18 03:05:39 2022 +0200
    [calico] don't enable ipip encapsulation by default and use vxlan in CI (#8434)
    * [calico] make vxlan encapsulation the default
    * don't enable ipip encapsulation by default
    * set calico_network_backend by default to vxlan
    * update sample inventory and documentation
    * [CI] pin default calico parameters for upgrade tests to ensure proper upgrade
    * [CI] improve netchecker connectivity testing
    * [CI] show logs for tests
    * [calico] tweak task name
    * [CI] Don't run the provisioner from vagrant since we run it in testcases_run.sh
    * [CI] move kube-router tests to vagrant to avoid network connectivity issues during netchecker check
    * service proxy mode still fails connectivity tests so keeping it manual mode
    * [kube-router] account for containerd use-case
docs/calico.md
docs/setting-up-your-first-cluster.md
docs/vars.md
inventory/sample/group_vars/k8s_cluster/k8s-net-calico.yml
roles/kubernetes/preinstall/tasks/0020-verify-settings.yml
roles/network_plugin/calico/defaults/main.yml
roles/network_plugin/calico/tasks/check.yml
roles/network_plugin/calico/tasks/install.yml
roles/network_plugin/calico/templates/calico-config.yml.j2
roles/network_plugin/calico/templates/calico-node.yml.j2
roles/network_plugin/kube-router/templates/kube-router.yml.j2
(...)

What is expected

Moving from one default to another should embed a migration script or guide.

If vxlan mode becomes the recommended mode, then I would like to migrate.

Unfortunately, there is nothing that :

  • warns me that this change will break my cluster network layer
  • or takes care of the migration process
  • or creates the required vxlan.calico interfaces on the nodes
  • or gives instructions in such case
@ledroide ledroide added the kind/bug Categorizes issue or PR as related to a bug. label Apr 6, 2022
@cristicalin
Copy link
Contributor

@ledroide note that you are deploying from an unreleased in-development branch, the release notes will contain guidelines about this breaking change in the defaults and what flags will need to be put in your ansible inventory to ensure existing deployments are not broken.

Moving from one encapsulation to another is not quite straight-forward but it had to be done to work around issues we detected with ipip out of the box. The new vxlan default is considered a future proof approach for long term sustainability of the project and existing deployments will have to put in backwards compatible flags to retain the old behaviour.

@ledroide
Copy link
Contributor Author

ledroide commented Apr 7, 2022

@cristicalin

Moving from one encapsulation to another is not quite straight-forward but it had to be done to work around issues

This is basically the purpose of this issue. I'm available to test further enhancements for the calico role, until it does not trigger network breakdown to the existing clusters. Thanks for your answer.
Serge

@cristicalin
Copy link
Contributor

Currently a defaults to defaults migration should fail in the validation stage if you have not configured your encapsulation parameters to match the existing environment.

TASK [network_plugin/calico : Check if inventory match current cluster configuration] ******************* 
task path: /root/kubespray/kubespray/roles/network_plugin/calico/tasks/check.yml:52                     
fatal: [kube-1]: FAILED! => {                                                                            
    "assertion": "not calico_pool_conf.spec.ipipMode is defined or calico_pool_conf.spec.ipipMode == calico_ipip_mode",
    "changed": false,                                                                                    
    "evaluated_to": false,                                                                               
    "msg": "Your inventory doesn't match the current cluster configuration"                             
}                                                                                                        
                                                                                                         
NO MORE HOSTS LEFT ************************************************************************************** 

PLAY RECAP ********************************************************************************************** 
kube-1                     : ok=1443 changed=184  unreachable=0    failed=1    skipped=1123 rescued=0    ignored=1   
kube-2                     : ok=738  changed=104  unreachable=0    failed=0    skipped=780  rescued=0    ignored=1   
kube-3                     : ok=525  changed=78   unreachable=0    failed=0    skipped=309  rescued=0    ignored=0   
localhost                  : ok=3    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   

Does this not happen in your environment?

The migration procedure is manual at the moment and not covered by kubespay code.

  1. Perform the migration while running kubespray 2.18.x:
calicoctl patch felixconfig default -p '{"spec":{"vxlanEnabled":true}}'
calicoctl patch ippool default-pool -p '{"spec":{"ipipMode":"Never", "vxlanMode":"Always"}}'   ## wait for the vxlan.calico interface to be created and traffic to be routed through it
calicoctl patch felixconfig default -p '{"spec":{"ipipEnabled":false}}'
  1. Run the cluster upgrade at which point you should no longer experience a traffic interruption.

@cristicalin
Copy link
Contributor

cristicalin commented Apr 8, 2022

It seems our check is actually very late in the process so I can see why this would break existing clusters, a simple fix would be to move this check much earlier in the validation phase and stop the upgrade before it breaks.

Looking at the changes the playbook does before running roles/network_plugin/calico/tasks/check.yml it doesn't look like any changes are actually done to the cluster.

@ledroide can you share an ansible log with -vvv ?

@cristicalin
Copy link
Contributor

Some more info, commenting out the validation task in roles/network_plugin/calico/tasks/check.yml allows the playbook to complete but our logic does not modify the default-pool calico ippool to change the encapsulation which causes the traffic outage.

@cristicalin
Copy link
Contributor

While we could update felixconfig and the ippool during the upgrade there is still an issue with recycling the calico-node pods when calico_network_backend changes. Right now this is mapped from a configmap kube-system/calico-config and the playbook does update the configmap but it does not recycle the calico-node pods this should not be an issue moving from ipip to vxlan since vxlan can continue to use bird backend but it would be an issue making this kind of change generic. If we end up allowing encapsulation changes through kubespray parameters then folks will star using and relying on it and might end up with broken environments if we don't cover all possible transition scenarios.

Considering the implications here, I'm strongly inclining towards moving the sanity check early in the playbook and just stopping the execution and documenting the manual steps to be performed pre-upgrade.

/cc @floryut @oomichi what do you think?

@floryut
Copy link
Member

floryut commented Apr 12, 2022

Sorry about the lag, no brainer to move the sanity check earlier for me 👍

@ledroide
Copy link
Contributor Author

Hello @cristicalin .
Here is how I have tested :

  1. set back the inventory specs in group_vars/k8s_cluster/k8s-net-calico.yaml ; I have removed these 3 lines from my previous workaround :
calico_ipip_mode: Always
calico_vxlan_mode: Never
calico_network_backend: bird
  1. from the HEAD kubespray repo (master branch, commit_id aef5f1e) I have cherry-picked your changes :
git fetch [email protected]:cristicalin/kubespray.git check_calico_encapsulation_early
git cherry-pick f38cf5581ba7dbd235e5ef6fb78b95808531c223
git log --oneline
3b2e5a17 [calico] call calico checks early on to prevent altering the cluster with bad configuration
aef5f1e1 Add tz to kubespray image
3d4baea0 Add tag to AWS VPC subnets for automatic subnet discovery by load balancers or ingress controllers (#8705)
  1. run cluster.yml as usual
  2. check calico status and interfaces on a random node
  • ip addr ls | grep vxlan -> none
  • sudo /usr/local/bin/calicoctl node status -> None of the BGP backend processes (BIRD or GoBGP) are running
  1. rollback
  • revert
$ git revert 3b2e5a173bc2241377d5bdd24f3bd1312c2e2e73
[master 37cf9a74] Revert "[calico] call calico checks early on to prevent altering the cluster with bad configuration"
 5 files changed, 99 insertions(+), 102 deletions(-)
  • set back ipip mode in my group_vars/k8s_cluster/k8s-net-calico.yaml file
  • run cluster.yml
  • sudo /usr/local/bin/calicoctl node status -> all nodes are up

I wanted to test without manual steps, in order to check what happens to a random guy that applies the default cluster.yml as usual.
Tell me if you want me to try an other way, or to check again after some change in your fork.

@cristicalin
Copy link
Contributor

@ledroide could you share the log from point 3? The playbook should have stopped with an assertion error like this:

fatal: [kube-1]: FAILED! => {                                                                            
    "assertion": "not calico_pool_conf.spec.ipipMode is defined or calico_pool_conf.spec.ipipMode == calico_ipip_mode",
    "changed": false,                                                                                    
    "evaluated_to": false,                                                                               
    "msg": "Your inventory doesn't match the current cluster configuration"                             
} 

@cristicalin
Copy link
Contributor

I just re-tested on a vagrant environment upgrading from release-2.18 branch to my PR branch and I reliably get the assertion failure as expected and no interruption in traffic.

@ledroide Are you setting ignore_assert_errors=True in your ansible inventory vars?

@ledroide
Copy link
Contributor Author

Are you setting ignore_assert_errors=True in your ansible inventory vars?

@cristicalin This value is not set at all in my inventory, and there was no assertion when I had run the cluster.yml playbook

@cristicalin
Copy link
Contributor

Could you share the execution logs of ansible-playbook -vvv ?

@ledroide
Copy link
Contributor Author

Hello @cristicalin
I have tested right now, following your instructions "Migrating from IP in IP to VXLAN" in calico.md
Pulled kubespray at commit 3f06591
Migration to vxlan mode looks good. However I was surprised that calicoctl node status does not show the nodes list anymore. But calicoctl get nodes -o wide shows them. I guess it's a normal behavior.

Suggestion : in calico.md, right after "IP in IP mode" and before "BGP mode" :

VXLAN mode

To configure VXLAN mode:

calico_ipip_mode: Never
calico_vxlan_mode: Always
calico_network_backend: vxlan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants