Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flannel stripping node taints #667

Closed
justinsb opened this issue Apr 10, 2017 · 15 comments
Closed

flannel stripping node taints #667

justinsb opened this issue Apr 10, 2017 · 15 comments

Comments

@justinsb
Copy link

justinsb commented Apr 10, 2017

Seeing some odd behaviour: kubelet should be registering a node with 1.6 taints (in a field) - I believe - but then the taints are not there when I look. I think this only happens with flannel.

I hypothesize that the following is happening:

  • kubelet registers the node with taints
  • flannel is still using the 1.5 APIs, which don't know about the taint field, so the taint is "gone" when it reads them
  • when flannels updates the node it writes back the version without the taint

Is this possible? If not, how do we protect against this scenario (cc @davidopp for that one I think)

@davidopp
Copy link

So IIUC the situation is that flannel is effectively a 1.5 client, running against a 1.6 master, that reads the Node object, makes some changes in memory, and then writes it back (does the Update() function you linked to use PATCH or PUT/POST? May not matter though) and the taints field gets lost because the field exists in 1.6 but not 1.5.

I think what you described is definitely possible. I'm actually not sure how to fix it (other than upgrading the client, of course).

cc/ @lavalamp or @xuchao to verify and hopefully provide some suggestion.

@justinsb
Copy link
Author

justinsb commented Apr 11, 2017

That is my hypothesis for what I have otherwise not yet been able to explain, so I thought I would ask here in flannel if it makes sense and if it has been observed.

There's also a general API issue which is how do we deal with this... protobuf has "unknown fields" but I don't think we have something similar for JSON?

And I think Update == PUT:


// Update takes the representation of a node and updates it. Returns the server's representation of the node, and an error, if there is any.
func (c *nodes) Update(node *v1.Node) (result *v1.Node, err error) {
        result = &v1.Node{}
        err = c.client.Put().
                Resource("nodes").
                Name(node.Name).
                Body(node).
                Do().
                Into(result)
        return
}

@justinsb
Copy link
Author

Probable confirmation (I believe):

  • I readded the taint to the node
  • I deleted the backendDataAnnotation (flannel.alpha.coreos.com/backend-data) annotation from the node
  • I restarted (all) the canal pods.
  • I manually polled the node for the taint and watched the pods restart.
  • As soon as the pods had restarted (and not before), I observed the taint was removed from the node.

This was actually with canal, not flannel (the original report was observed with flannel though); if there is still doubt let me know and I can repeat experiments or try to do something more scientific.

@davidopp
Copy link

Here's an example of using Patch() instead of Put():
https://github.com/kubernetes/kubernetes/blob/3b1d2343a84d3b8e3fd21554568e06b8594cdf26/pkg/controller/controller_utils.go#L935

But it looks like the diff is done on the client (I was hoping it was done on the server), so I don't think it would solve this problem.

@davidopp
Copy link

And I wonder if 1.6 master with 1.5 kubelet has the same problem when Kubelet updates the Node object to update status (even though taints are in spec, not status, the granularity is the whole object).

@wojtek-t
Copy link

I think that using PATCH would solve the problem.
The diff on the client-side is done to compute the actual patch (that is the send to apiserver). If the client doesn't understand some field, it shouldn't have it neither in the "original" object not in "patched" one. So the patch should actually be correct.
And when it is applied in apiserver, apiserver understands all fields, so the patched object should be exactly as expected.

That should also answer the question that 1.5 kubelet + 1.6 master should work (because kubelet is using PATCH).

So I guess the medium term solution is to migrate flannel (and canal whatever that is) to use PATCH. Though I don't have any good solution for what to do now (other than upgrading it to 1.6 too).

@davidopp
Copy link

Great - thanks @wojtek-t !

@lavalamp
Copy link

This is a longstanding problem in our client, that we delete fields the we don't understand. (Although even if we fix it by default there, users will still be able to get it wrong. But we're currently broken by default.)

Using PATCH is a pretty thorough fix.

@tomdee
Copy link
Contributor

tomdee commented Apr 12, 2017

So if I'm understanding this correctly, we should fairly urgently switch flannel to use the v1.6 client, and before 1.7 comes out, switch it to using patch?

@tomdee tomdee changed the title Is it possible that flannel strips node taints? flannel stripping node taints Apr 12, 2017
@tomdee
Copy link
Contributor

tomdee commented Apr 12, 2017

The flannel kube subnet manager uses the "k8s.io/kubernetes/pkg/client/cache" package which doesn't exist in 1.6. Is there anyone with more kubernetes client experience than me who could help fix this? cc @mikedanese

(Also this could be a chance to resolve/improve #674)

@tomdee
Copy link
Contributor

tomdee commented Apr 12, 2017

Urghh - it also imports "k8s.io/kubernetes/pkg/controller/framework" which was removed in v1.4 AFAICT

@davidopp
Copy link

I believe the idea is that you can stick to the 1.5 client if you switch it to use patch, OR you can switch to the 1.6 client. But yeah, you should definitely switch to using patch before 1.7 either way.

@euank
Copy link

euank commented Apr 12, 2017

@tomdee I believe it ended up in tools/cache in client-go 3.

See https://github.com/kubernetes/client-go/tree/release-3.0/tools/cache

Moving to client-go rather than using kubernetes/pkg directly should be the easiest update path IMO.

@diegs
Copy link

diegs commented Apr 18, 2017

I believe this is fixed by #681.

There's also #677 to move to client-go.

(Big ups to @mikedanese for putting both of those together)

@tomdee
Copy link
Contributor

tomdee commented Apr 20, 2017

Yep - the 0.7.1 release should have the PATCH fix in and the lates flannel-git (https://quay.io/repository/coreos/flannel-git?tab=tags) releases should have the client-go fixes in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants