-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vault considers itself healthy when can't write to audit log #11949
Comments
I am not sure if this qualifies as a reason for marking the server as unhealthy. I'll take this up internally to discuss with the team and get back to you if I have an update. |
We've had some internal discussions and agree that this is a worthwhile and doable enhancement. There are still some details that would need to be ironed out, such as:
|
Hi @ncabatoff I don't really know how health check works (internally). And about flapping, I'm not sure that this needs to be addressed separately, as it is impacting all requests same way. So it should be handled as any other request. |
I assume something is missing here https://github.com/hashicorp/vault/blob/main/http/sys_health.go#L73, health-check is giving misleading information about cluster health (500 vs. 200) when audit device is blocked, maybe a call to $ curl -i -H "X-Vault-Token: $(cat ~/.vault-token)" $VAULT_ADDR/v1/sys/mounts
HTTP/1.1 500 Internal Server Error
Cache-Control: no-store
Content-Type: application/json
Date: Thu, 08 Jul 2021 12:18:28 GMT
Content-Length: 30
{"errors":["internal error"]} $ curl -i $VAULT_ADDR/v1/sys/health?standbyok=true
HTTP/1.1 200 OK
Cache-Control: no-store
Content-Type: application/json
Date: Thu, 08 Jul 2021 12:18:38 GMT
Content-Length: 294
{"initialized":true,"sealed":false,"standby":false,"performance_standby":false,"replication_performance_mode":"disabled","replication_dr_mode":"disabled","server_time_utc":1625746718,"version":"1.7.3","cluster_name":"vault-cluster-a9f6d609","cluster_id":"1268406e-9df0-d74b-43a0-a2c8354afbe6"} any reason not to make these two consistent? |
That would certainly be an easy solution. Two problems I have with that approach: first, some systems may be configured to hit this endpoint quite often, and if we turn on auditing we may fill up the disks of those clusters. Picture a system that normally sees 1 audited request per minute, but whose status endpoint is hit 4 times per minute, and that suddenly needs 5 times the audit storage it did before. Second problem: the purpose of audit records is primarily to track authenticated requests, because unauthenticated requests generally don't matter from a secrets perspective. Users may rightly object to an audit log filled with irrelevant status requests that they don't care about. |
Since we don't like the idea of doing "probes" (fake audit records) to discover the state of the audit device, here's what we're left with in terms of a plan:
I don't know if we're going to work on this anytime soon. Most users who've been bit by audit related failures in the past now use two different audit devices to protect against this, since Vault will allow requests to proceed provided at least one audit device succeeds. We're willing to accept a PR for the above plan, but since this is a critical part of Vault request handling we'll be very picky about implementation details. |
We ran into the exact same issue;
I realise that relating the failed audit device(s) and health-check is difficult, but in my opinion: Vault is broken, the only thing that still works is the health endpoint. A bit ironic even. ;-) |
We also ran into this issue just last week. Disk got full on the active node in a cluster. |
Describe the bug
Due to some misconfiguration on our side, logrotate created file for audit log with wrong permissions.
After what vault did not server any requests because could not write to audit log (intended behavior).
But health check reported that vault is healthy.
I believe if vault can not respond to a request it should be marked as unhealthy.
To Reproduce
Steps to reproduce the behavior:
touch /tmp/vault-audit.log
sudo chmod 0640 /tmp/vault-audit.log
vault audit enable file file_path=/tmp/vault-audit.log
vault status
vault secrets list
sudo chown root:root /tmp/vault-audit.log
kill -1 $VAULT_PID
vault status
vault secrets list
backend log
Expected behavior
Some indication that server is unable to serve requests, and http status code in 5XX when accessing health endpoint
The text was updated successfully, but these errors were encountered: