[Calico] Fix delay setting up ip routes in new nodes #4589

felipejfc · 2018-03-06T15:51:25Z

Same as PR #4588 but with latest changes from master

…CALICO_K8S_NODE_REF in calico-node, this commit fixes kubernetes#3224 and kubernetes#4533

mikesplain · 2018-03-06T15:57:18Z

This is awesome, thanks for your work on this @felipejfc.

/ok-to-test
/assign

mikesplain · 2018-03-06T15:58:43Z

We've been dealing with this for awhile too so I'll give this a test as well.

robinpercy · 2018-03-06T17:47:24Z

upup/pkg/fi/cloudup/bootstrapchannelbuilder.go

@@ -469,8 +469,8 @@ func (b *BootstrapChannelBuilder) buildManifest() (*channelsapi.Addons, map[stri
 		key := "networking.projectcalico.org"
 		versions := map[string]string{
 			"pre-k8s-1.6": "2.4.2-kops.1",
-			"k8s-1.6":     "2.6.7-kops.1",
-			"k8s-1.7":     "2.6.7-kops.1",
+			"k8s-1.6":     "2.6.8-kops.1",


We just need to update the suffix here: (ie 2.6.7-kops.2)

the reason that I bumped the version to 2.6.8 is that if I didn't, then existing clusters would not get the current deployment and daemonset updated

Right but 2.6.8 would imply a different image which is not the case, just a config change, right? I agree with @robinpercy. We shouldn't need to bump the version, just the kops.1 -> kops.2 number

oh, got it!

mikesplain · 2018-03-06T21:44:43Z

/assign robinpercy

…nt the suffix and little docs improvement

felipejfc · 2018-03-06T22:44:53Z

version in now fixed @robinpercy @mikesplain

mikesplain · 2018-03-07T00:34:16Z

Awesome! Thanks @felipejfc, I'm going to test this locally now!

robinpercy · 2018-03-07T00:35:22Z

Thanks @felipejfc!
/lgtm

mikesplain · 2018-03-07T02:48:13Z

I'm having issues testing this at the moment, will circle back tomorrow, but I think this should solve my issue too. Thanks so much for the contribution!

/lgtm

mikesplain · 2018-03-07T18:22:37Z

I was able to get a test through of this and it worked great! Thanks so much @felipejfc and @caseydavenport!

robinpercy · 2018-03-07T20:14:47Z

/assign @chrislovecnm

chrislovecnm

Looks great. Nit picks in docs. Appreciate the help!

chrislovecnm · 2018-03-09T01:18:02Z

docs/networking.md

+#### Calico troubleshooting
+##### New nodes are taking minutes for syncing ip routes and new pods on them can't reach kubedns
+This is caused by nodes in the Calico etcd nodestore no longer existing. Due to the ephemeral nature of AWS EC2 instances, new nodes are brought up with different hostnames, and nodes that are taken offline remain in the Calico nodestore. This is unlike most datacentre deployments where the hostnames are mostly static in a cluster. Read more about this issue at https://github.com/kubernetes/kops/issues/3224
+This has been solved in kops 1.8.2, when creating a new cluster no action is needed, but if the cluster was created with a prior kops version the following actions should be taken:


This will need to be update to kops 1.9.0

chrislovecnm · 2018-03-09T01:18:18Z

docs/networking.md

+This is caused by nodes in the Calico etcd nodestore no longer existing. Due to the ephemeral nature of AWS EC2 instances, new nodes are brought up with different hostnames, and nodes that are taken offline remain in the Calico nodestore. This is unlike most datacentre deployments where the hostnames are mostly static in a cluster. Read more about this issue at https://github.com/kubernetes/kops/issues/3224
+This has been solved in kops 1.8.2, when creating a new cluster no action is needed, but if the cluster was created with a prior kops version the following actions should be taken:
+  * Use kops to update the cluster ```kops update cluster <name> --yes``` and wait for calico-kube-controllers deployment and calico-node daemonset to be updated
+  * Delete all calico-node pods in kube-system namespace, so that they will be recreated with the new env CALICO_K8S_NODE_REF and update the current nodes in etcd


Can we give kubectl command?

I did it manually and right now I have no time to think of a command that would do everything :/, maybe @mikesplain or @caseydavenport have a script for that?

Actually, I would suggest just rolling over the whole cluster... Deleting the pods could create temporary connectivity loss couldn't it? I drain and roll the cluster anytime I need to change core daemonsets like this.

actually not, I always do kubectl delete pods -n kube-system -l "k8s-app=calico-node" --force --grace-period=0
deleting calico-node pods does not delete the iptables rules and interfaces that were already created, so no problem on that.

@chrislovecnm you mean kubectl command to delete calico-node pods or to delete the invalid nodes in etcd storage, the later is the one I have no time to provide, the first one is in my comment above

@felipejfc Fair enough! I wasn't certain. I have a script/container open sourced that we could use: https://github.com/mikesplain/calico-clean/blob/master/calico-clean.sh

Would just need to convert https://github.com/mikesplain/calico-clean/blob/master/CronJob.yaml into a one time job.

Do we need to delete the calico-node pods in a second step? Won't upgrading the cluster trigger a rolling update due to the env change?

If not couldn't we enable rolling update on the daemonset and then it would?

Deleting all workload pods shouldn't be necessary (nor should rolling the whole cluster).

I aggre it should be ok to add a rolling update strategy @caseydavenport, what do you guys think @mikesplain @chrislovecnm @robinpercy

Personally, I've never used rolling update strategy on daemonset. I could see cases when upgrading an entire cluster would trigger a daemonset upgrade which may not be intended on older nodes... We could put it behind a config value or something. I have mixed feelings... Thoughts @chrislovecnm @robinpercy ?

It should only trigger when changes are made to the DaemonSet (not other things in the cluster) so I suspect it's OK and desirable.

We enable it by default in upstream Calico manifests.

chrislovecnm · 2018-03-09T01:18:47Z

docs/networking.md

+This has been solved in kops 1.8.2, when creating a new cluster no action is needed, but if the cluster was created with a prior kops version the following actions should be taken:
+  * Use kops to update the cluster ```kops update cluster <name> --yes``` and wait for calico-kube-controllers deployment and calico-node daemonset to be updated
+  * Delete all calico-node pods in kube-system namespace, so that they will be recreated with the new env CALICO_K8S_NODE_REF and update the current nodes in etcd
+  * Decommission all invalid nodes, [see here](https://docs.projectcalico.org/v2.6/usage/decommissioning-a-node)


Should we just recommend rolling the cluster?

it will have a higher overhead rolling update the cluster, what do you think if I just point it as an alternative?

ese · 2018-03-09T14:39:49Z

should it be ported also for canal? https://github.com/kubernetes/kops/blob/master/upup/models/cloudup/resources/addons/networking.projectcalico.org.canal/k8s-1.8.yaml.template

felipejfc · 2018-03-09T20:04:45Z

I guess that the damage this does to clusters using Canal is smaller @ese due to flannel doing the routing part (right?), still I don't see why not port it...

mikesplain · 2018-03-09T21:11:30Z

@felipejfc If you want to open the canal change, I'd make it a separate PR so we can get this one through, and likely that one will move quicker once we figure out a few of these details :)

felipejfc · 2018-03-12T18:36:30Z

so, I've applied an updateStrategy to calico-node daemonset (as per @caseydavenport recommendation), removed the "delete calico-node pods" step from the docs as its no longer necessary and updated the "fixed on" version to 1.9.0

mikesplain · 2018-03-13T12:33:26Z

Great! Thanks so much @felipejfc and @caseydavenport!

/lgtm

Mind taking a look again @chrislovecnm

chrislovecnm · 2018-03-17T20:35:35Z

/lgtm

We need to reference the doc in the release notes.

k8s-ci-robot · 2018-03-17T20:35:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chrislovecnm, felipejfc, mikesplain, robinpercy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [chrislovecnm]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

[Calico] Activate node controller in calico-kube-controllers and add …

468d941

…CALICO_K8S_NODE_REF in calico-node, this commit fixes kubernetes#3224 and kubernetes#4533

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 6, 2018

k8s-ci-robot assigned mikesplain Mar 6, 2018

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 6, 2018

robinpercy reviewed Mar 6, 2018

View reviewed changes

k8s-ci-robot assigned robinpercy Mar 6, 2018

mikesplain mentioned this pull request Mar 6, 2018

[Calico] Fix delay setting up ip routes in new nodes #4588

Closed

roll back calico version in bootstrapchannelbuilder to 2.6.7, increme…

4d7d8b8

…nt the suffix and little docs improvement

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 7, 2018

k8s-ci-robot assigned chrislovecnm Mar 7, 2018

chrislovecnm reviewed Mar 9, 2018

View reviewed changes

apply a rolling strategy to calico-node daemonset, improve docs

e9ef618

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 12, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 13, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 17, 2018

k8s-ci-robot merged commit 98ba08f into kubernetes:master Mar 17, 2018

[Calico] Fix delay setting up ip routes in new nodes #4589

[Calico] Fix delay setting up ip routes in new nodes #4589

Conversation

felipejfc commented Mar 6, 2018

mikesplain commented Mar 6, 2018

mikesplain commented Mar 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikesplain commented Mar 6, 2018

felipejfc commented Mar 6, 2018 • edited Loading

mikesplain commented Mar 7, 2018

robinpercy commented Mar 7, 2018

mikesplain commented Mar 7, 2018

mikesplain commented Mar 7, 2018

robinpercy commented Mar 7, 2018

chrislovecnm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felipejfc Mar 9, 2018 • edited Loading

Choose a reason for hiding this comment

mikesplain Mar 9, 2018 • edited Loading

Choose a reason for hiding this comment

felipejfc Mar 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

caseydavenport Mar 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ese commented Mar 9, 2018

felipejfc commented Mar 9, 2018 • edited Loading

mikesplain commented Mar 9, 2018

felipejfc commented Mar 12, 2018

mikesplain commented Mar 13, 2018

chrislovecnm commented Mar 17, 2018

k8s-ci-robot commented Mar 17, 2018

felipejfc commented Mar 6, 2018 •

edited

Loading

felipejfc Mar 9, 2018 •

edited

Loading

mikesplain Mar 9, 2018 •

edited

Loading

felipejfc Mar 9, 2018 •

edited

Loading

caseydavenport Mar 9, 2018 •

edited

Loading

felipejfc commented Mar 9, 2018 •

edited

Loading