Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault considers itself healthy when can't write to audit log #11949

Open
DrMagPie opened this issue Jun 28, 2021 · 8 comments
Open

Vault considers itself healthy when can't write to audit log #11949

DrMagPie opened this issue Jun 28, 2021 · 8 comments

Comments

@DrMagPie
Copy link

Describe the bug
Due to some misconfiguration on our side, logrotate created file for audit log with wrong permissions.
After what vault did not server any requests because could not write to audit log (intended behavior).
But health check reported that vault is healthy.

I believe if vault can not respond to a request it should be marked as unhealthy.

To Reproduce
Steps to reproduce the behavior:

  1. touch /tmp/vault-audit.log
  2. sudo chmod 0640 /tmp/vault-audit.log
  3. vault audit enable file file_path=/tmp/vault-audit.log
  4. vault status
---             -----
Seal Type       shamir
Initialized     true
Sealed          false
  1. vault secrets list
----          ----         --------              -----------
cubbyhole/    cubbyhole    cubbyhole_ca631048    per-token private secret storage
identity/     identity     identity_36ea49cf     identity store
secret/       kv           kv_2f024240           key/value secret storage
sys/          system       system_00aaccc1       system endpoints used for control, policy and debugging
  1. sudo chown root:root /tmp/vault-audit.log
  2. kill -1 $VAULT_PID
  3. vault status
---             -----
Seal Type       shamir
Initialized     true
Sealed          false
  1. vault secrets list

URL: GET http://127.0.0.1:8200/v1/sys/mounts
Code: 500. Errors:

* internal error

backend log

==> Vault reload triggered
2021-06-28T10:49:28.564Z [ERROR] no config found at reload time
2021-06-28T10:49:28.564Z [INFO]  audit: reloading file audit backend: path=file/
Error(s) were encountered during reload: 1 error occurred:
        * error encountered reloading file audit device at path "file/": open /tmp/vault-audit.log: permission denied


2021-06-28T10:49:48.321Z [ERROR] audit: backend failed to log request: backend=file/ error="open /tmp/vault-audit.log: permission denied"
2021-06-28T10:49:48.321Z [ERROR] core: failed to audit request: path=sys/mounts
  error=
  | 1 error occurred:
  |     * no audit backend succeeded in logging the request
  |

2021-06-28T10:49:48.321Z [ERROR] audit: backend failed to log response: backend=file/ error="open /tmp/vault-audit.log: permission denied"
2021-06-28T10:49:48.321Z [ERROR] core: failed to audit response: request_path=sys/mounts
  error=
  | 1 error occurred:
  |     * no audit backend succeeded in logging the response
  |

Expected behavior
Some indication that server is unable to serve requests, and http status code in 5XX when accessing health endpoint

@DrMagPie DrMagPie changed the title Vault considers it self healthy when can't write to audit log Vault considers itself healthy when can't write to audit log Jun 28, 2021
@vishalnayak
Copy link
Contributor

I am not sure if this qualifies as a reason for marking the server as unhealthy. I'll take this up internally to discuss with the team and get back to you if I have an update.

@ncabatoff
Copy link
Collaborator

We've had some internal discussions and agree that this is a worthwhile and doable enhancement. There are still some details that would need to be ironed out, such as:

  • how do we determine when it's healthy again - should we retry failed audit requests until they succeed, for example?
  • how do we avoid flapping when there are intermittent write failures, e.g. a near-full disk where log rotation is periodically making a little bit of space that then fills up again?

@DrMagPie
Copy link
Author

DrMagPie commented Jul 8, 2021

Hi @ncabatoff

I don't really know how health check works (internally).
But I think that easiest solution to this issue would be just to audit log /v1/sys/health requests.
In this case vault will respond the same way as all other requests.

And about flapping, I'm not sure that this needs to be addressed separately, as it is impacting all requests same way. So it should be handled as any other request.
And as additional bonus it will identify when there are some underlying issues, like lack of storage or poorly configured log rotation as examples.

@avoidik
Copy link
Contributor

avoidik commented Jul 8, 2021

I assume something is missing here https://github.com/hashicorp/vault/blob/main/http/sys_health.go#L73, health-check is giving misleading information about cluster health (500 vs. 200) when audit device is blocked, maybe a call to generateAuditTestProbe

$ curl -i -H "X-Vault-Token: $(cat ~/.vault-token)" $VAULT_ADDR/v1/sys/mounts
HTTP/1.1 500 Internal Server Error
Cache-Control: no-store
Content-Type: application/json
Date: Thu, 08 Jul 2021 12:18:28 GMT
Content-Length: 30

{"errors":["internal error"]}
$ curl -i $VAULT_ADDR/v1/sys/health?standbyok=true
HTTP/1.1 200 OK
Cache-Control: no-store
Content-Type: application/json
Date: Thu, 08 Jul 2021 12:18:38 GMT
Content-Length: 294

{"initialized":true,"sealed":false,"standby":false,"performance_standby":false,"replication_performance_mode":"disabled","replication_dr_mode":"disabled","server_time_utc":1625746718,"version":"1.7.3","cluster_name":"vault-cluster-a9f6d609","cluster_id":"1268406e-9df0-d74b-43a0-a2c8354afbe6"}

any reason not to make these two consistent?

@ncabatoff
Copy link
Collaborator

ncabatoff commented Jul 8, 2021

But I think that easiest solution to this issue would be just to audit log /v1/sys/health requests.

That would certainly be an easy solution. Two problems I have with that approach: first, some systems may be configured to hit this endpoint quite often, and if we turn on auditing we may fill up the disks of those clusters. Picture a system that normally sees 1 audited request per minute, but whose status endpoint is hit 4 times per minute, and that suddenly needs 5 times the audit storage it did before. Second problem: the purpose of audit records is primarily to track authenticated requests, because unauthenticated requests generally don't matter from a secrets perspective. Users may rightly object to an audit log filled with irrelevant status requests that they don't care about.

@ncabatoff
Copy link
Collaborator

Since we don't like the idea of doing "probes" (fake audit records) to discover the state of the audit device, here's what we're left with in terms of a plan:

  1. Use a circuit breaker around audit writes: after enough failures over a given period has occurred, make the vault health endpoint return a non-healthy state. Using a circuit breaker will help reduce flapping due to occasional failures.
  2. Periodically, reset the health status of vault nodes that are in an audit-unhealthy state. In other words, we're guessing/hoping that the situation may have been resolved, and we want any load balancer that might be avoiding sending traffic to us to try again. This is the alternative to writing fake audit records to discover when auditing starts working again.
  3. Expose metrics relating to audit failures so that additional insight is available without having to look in logs.

I don't know if we're going to work on this anytime soon. Most users who've been bit by audit related failures in the past now use two different audit devices to protect against this, since Vault will allow requests to proceed provided at least one audit device succeeds. We're willing to accept a PR for the above plan, but since this is a critical part of Vault request handling we'll be very picky about implementation details.

@robertdebock
Copy link
Contributor

We ran into the exact same issue;

  • No audit device left (Vault is broken)
  • The health point returns "okay".

I realise that relating the failed audit device(s) and health-check is difficult, but in my opinion: Vault is broken, the only thing that still works is the health endpoint. A bit ironic even. ;-)

This issue is also describe here and here.

@abarclay
Copy link

We also ran into this issue just last week. Disk got full on the active node in a cluster.
It really would have been nice for the health checks to fail when this happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants