Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Healthcheck endpoint #616

Closed
rosenhouse opened this issue Feb 22, 2017 · 3 comments
Closed

Healthcheck endpoint #616

rosenhouse opened this issue Feb 22, 2017 · 3 comments

Comments

@rosenhouse
Copy link

rosenhouse commented Feb 22, 2017

As an operator, I would like to be alerted if my flanneld is unable to connect to etcd.

Currently, if flannel loses its connection to etcd, it prints error messages to stderr but does not fail.

I could write some tooling to process those logs and alert me if certain strings are printed.

But I'd rather have some kind of healthcheck, e.g. I set up a script to periodically curl a special endpoint on localhost. If I get back a 200 OK then I know flanneld is healthy. Otherwise, I can raise an alert, maybe restart the VM, etc.

I could imagine the healthcheck returning some basic information, like when it last renewed its lease with etcd. But the important thing is a simple status code that could be easily interpreted and acted on.

A CNI plugin could even probe this healthcheck during the ADD action. That way it could ensure that flanneld is alive, and that the subnet.env file on disk is up-to-date.

Would you be open to a PR like this?

cc: @rusha19 @mcwumbly @jaydunk

@rosenhouse
Copy link
Author

Ping @lxpollitt. We're probably going to implement this in some form on a fork in order to support integration with Cloud Foundry. Any input you have would be appreciated.

@tomdee
Copy link
Contributor

tomdee commented Mar 9, 2017

Sounds useful to me.

genevieve pushed a commit to cf-container-networking/flannel that referenced this issue Mar 9, 2017
Begins to address flannel-io#616

Signed-off-by: Gabe Rosenhouse <[email protected]>
@andyxning andyxning mentioned this issue May 17, 2017
@jsravn
Copy link

jsravn commented May 3, 2018

@tomdee Wonder why you closed this? It's a pretty big gap. We recently had an outage because of this and there is no way to monitor flanneld health properly at the moment, as far as I can tell.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants