-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calico role fails migration from ipip to vxlan mode #8691
Comments
@ledroide note that you are deploying from an unreleased in-development branch, the release notes will contain guidelines about this breaking change in the defaults and what flags will need to be put in your ansible inventory to ensure existing deployments are not broken. Moving from one encapsulation to another is not quite straight-forward but it had to be done to work around issues we detected with ipip out of the box. The new |
This is basically the purpose of this issue. I'm available to test further enhancements for the calico role, until it does not trigger network breakdown to the existing clusters. Thanks for your answer. |
Currently a defaults to defaults migration should fail in the validation stage if you have not configured your encapsulation parameters to match the existing environment.
Does this not happen in your environment? The migration procedure is manual at the moment and not covered by kubespay code.
calicoctl patch felixconfig default -p '{"spec":{"vxlanEnabled":true}}'
calicoctl patch ippool default-pool -p '{"spec":{"ipipMode":"Never", "vxlanMode":"Always"}}' ## wait for the vxlan.calico interface to be created and traffic to be routed through it
calicoctl patch felixconfig default -p '{"spec":{"ipipEnabled":false}}'
|
It seems our check is actually very late in the process so I can see why this would break existing clusters, a simple fix would be to move this check much earlier in the validation phase and stop the upgrade before it breaks. Looking at the changes the playbook does before running @ledroide can you share an ansible log with |
Some more info, commenting out the validation task in |
While we could update felixconfig and the ippool during the upgrade there is still an issue with recycling the calico-node pods when Considering the implications here, I'm strongly inclining towards moving the sanity check early in the playbook and just stopping the execution and documenting the manual steps to be performed pre-upgrade. |
Sorry about the lag, no brainer to move the sanity check earlier for me 👍 |
Hello @cristicalin .
calico_ipip_mode: Always
calico_vxlan_mode: Never
calico_network_backend: bird
git fetch [email protected]:cristicalin/kubespray.git check_calico_encapsulation_early
git cherry-pick f38cf5581ba7dbd235e5ef6fb78b95808531c223
git log --oneline
$ git revert 3b2e5a173bc2241377d5bdd24f3bd1312c2e2e73
[master 37cf9a74] Revert "[calico] call calico checks early on to prevent altering the cluster with bad configuration"
5 files changed, 99 insertions(+), 102 deletions(-)
I wanted to test without manual steps, in order to check what happens to a random guy that applies the default cluster.yml as usual. |
@ledroide could you share the log from point 3? The playbook should have stopped with an assertion error like this:
|
I just re-tested on a vagrant environment upgrading from @ledroide Are you setting |
@cristicalin This value is not set at all in my inventory, and there was no assertion when I had run the cluster.yml playbook |
Could you share the execution logs of |
Hello @cristicalin Suggestion : in calico.md, right after "IP in IP mode" and before "BGP mode" : VXLAN modeTo configure VXLAN mode: calico_ipip_mode: Never
calico_vxlan_mode: Always
calico_network_backend: vxlan |
Effects
Networking inside Kubernetes (pods, services, etc.) does not work anymore after upgrading to Kubernetes 1.23.5 with kubespray at commit id 0481dd9.
Symptoms
Summary :
Versions
Workaround
Set back ipip mode as documented in docs/calico.md, and run cluster.yml playbook.
Here is my group_vars/k8s_cluster/k8s-net-calico.yaml configuration (just added 3 last lines) :
You can check with calicoctl that calico works again :
Assumption
The issue raises with this change :
What is expected
Moving from one default to another should embed a migration script or guide.
If vxlan mode becomes the recommended mode, then I would like to migrate.
Unfortunately, there is nothing that :
The text was updated successfully, but these errors were encountered: