Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics to identify audit logging failures #2863

Closed
edjackson-wf opened this issue Jun 14, 2017 · 6 comments
Closed

Add metrics to identify audit logging failures #2863

edjackson-wf opened this issue Jun 14, 2017 · 6 comments
Milestone

Comments

@edjackson-wf
Copy link

edjackson-wf commented Jun 14, 2017

It would be very helpful to be able to alert an operator when there are audit logging failures. Because this information isn't available from the /sys/health endpoint, we need some other means.

I would suggest the addition of some appropriate metrics in the telemetry, so alerting can be done from statsd/statsite.

I don't have strong feelings about exactly what the metrics should be. Being able to monitor 500 response codes would certainly help, or maybe the audit logging backends should provide more specific error metrics.

See also this conversation.

@jefferai jefferai added this to the 0.7.4 milestone Jun 15, 2017
@csawyerYumaed
Copy link
Contributor

I just had this issue in production this morning.

This morning we had a failure in our audit backend(the file was rotated, but the -HUP signal did not happen for some reason).

vault log:

2017/06/19 05:41:59.431861 [ERROR] audit: backend failed to log response: backend=file/ error=write vault_audit.log: bad file descriptor
2017/06/19 05:41:59.431874 [ERROR] core: failed to audit response: request_path=auth/token/renew-self error=no audit backend succeeded in logging the response2017/06/19 05:42:26.046577 [ERROR] audit: backend failed to log request: backend=file/ error=write vault_audit.log: bad file descriptor
2017/06/19 05:42:26.046605 [ERROR] core: failed to audit request: path=auth/token/renew-self error=1 error occurred:

  • no audit backend succeeded in logging the request

yet, the health checks were continuing to pass, despite error 500's on any read/write call to vault. This seems.. wrong somehow :)

Ideally, I'd love if vault on a bad FD would just try a re-open like if a HUP signal happened. Especially since audit is crucial to a valid operating vault system.

also, the /health check should probably at least WARN, if not outright FAIL, if it can't write the audit log for whatever reason.

@jefferai
Copy link
Member

@edjackson-wf Any chance you can tell me if you think https://github.com/hashicorp/vault/pull/3001/files meets your needs? I figured that an incrementing counter is probably the right way to go.

@edjackson-wf
Copy link
Author

@jefferai Yes, I think that would work for me.

I suppose it might be worth considering the case where multiple audit backends are enabled and one fails. Is it worth distinguishing between audit failures that cause requests to fail and those that don't? It's not my use case, but it seems plausible.

@jefferai
Copy link
Member

@edjackson-wf We can add more specific metrics later if needed, but I'd argue that any time that counter is going up continually there's a bad situation just waiting to happen, regardless of which backend is currently experiencing the problem. At that point logs will tell the rest.

@edjackson-wf
Copy link
Author

@jefferai Fair enough. Thanks a bunch for adding this.

@jefferai
Copy link
Member

No problem!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants