Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option 'elide_list_responses' to audit backends #18128

Merged
merged 7 commits into from
Jan 11, 2023

Conversation

maxb
Copy link
Contributor

@maxb maxb commented Nov 28, 2022

This PR relates to a feature request logged through HashiCorp commercial support: https://app.asana.com/0/112957393038545/1201850823939444/f

Vault lacks pagination in its APIs. As a result, certain list operations can return very large responses. The user's chosen audit sinks may experience difficulty consuming audit records that swell to tens of megabytes of JSON.

In our case, one of the systems consuming audit log data could not cope, and failed.

The responses of list operations are typically not very interesting, as they are mostly lists of keys, or, even when they include a "key_info" field, are not returning confidential information. They become even less interesting once HMAC-ed by the audit system.

Some example Vault "list" operations that are prone to becoming very large in an active Vault installation are:

auth/token/accessors/
identity/entity/id/
identity/entity-alias/id/
pki/certs/

In response, I've coded a new option that can be applied to audit backends, elide_list_responses. When enabled, response data is elided from audit logs, only when the operation type is "list".

For added safety, the elision only applies to the "keys" and "key_info" fields within the response data - these are conventionally the only fields present in a list response - see logical.ListResponse, and logical.ListResponseWithInfo. However, other fields are technically possible if a plugin author writes unusual code, and these will be preserved in the audit log even with this option enabled.

The elision replaces the values of the "keys" and "key_info" fields with an integer count of the number of entries. This allows even the elided audit logs to still be useful for answering questions like "Was any data returned?" or "How many records were listed?".


There are obviously a few things missing in this PR - changelog, documentation, tests.

For now, the intent is to start a conversation with HashiCorp about the proposed implementation, and get that validated before filling in the rest.

maxb added 2 commits November 28, 2022 16:09
This PR relates to a feature request logged through HashiCorp commercial
support.

Vault lacks pagination in its APIs. As a result, certain list operations
can return **very** large responses.  The user's chosen audit sinks may
experience difficulty consuming audit records that swell to tens of
megabytes of JSON.

In our case, one of the systems consuming audit log data could not cope,
and failed.

The responses of list operations are typically not very interesting, as
they are mostly lists of keys, or, even when they include a "key_info"
field, are not returning confidential information. They become even less
interesting once HMAC-ed by the audit system.

Some example Vault "list" operations that are prone to becoming very
large in an active Vault installation are:

    auth/token/accessors/
    identity/entity/id/
    identity/entity-alias/id/
    pki/certs/

In response, I've coded a new option that can be applied to audit
backends, `elide_list_responses`. When enabled, response data is elided
from audit logs, only when the operation type is "list".

For added safety, the elision only applies to the "keys" and "key_info"
fields within the response data - these are conventionally the only
fields present in a list response - see logical.ListResponse, and
logical.ListResponseWithInfo. However, other fields are technically
possible if a plugin author writes unusual code, and these will be
preserved in the audit log even with this option enabled.

The elision replaces the values of the "keys" and "key_info" fields with
an integer count of the number of entries. This allows even the elided
audit logs to still be useful for answering questions like "Was any data
returned?" or "How many records were listed?".
@ncabatoff
Copy link
Collaborator

Hi Max,

This makes sense, and indeed I was in a discussion just last week re this issue. I wonder if rather than doing it globally on a per audit device basis, it would be better to make it configurable per request path? So just as you can tune a mount to specify keys in the request/response object that shouldn't be hmac'd, maybe you could tune a mount to specify LIST endpoints whose responses should be suppressed.

I'm not asking you to make this change at this point, this is just me thinking out loud. It's not mutually exclusive with your global approach anyway, maybe we want both.

@maxb
Copy link
Contributor Author

maxb commented Nov 30, 2022

That's an interesting question...

In one way, it would make things easier for me, in that I'm already dreading how we manage to deploy this in production, even if it's merged ... because currently Vault has no way to reconfigure an existing audit device. That's a problem, since toggling this new option would require fully removing the existing audit device and re-adding it. This causes two bad things to happen: 1) loss of audit data whilst the removal and addition is ongoing; 2) loss of the previous audit salt, so old audit logs cannot have values looked up via the sys/audit-hash/ API.

(I have been wondering whether it would be practical to add live reconfiguration to audit devices...)

However, there are other reasons I think setting this at the audit device level is the better:

  • Some users may want to turn this on for some audit devices, but off for others - for example an audit stream being transmitted to something that can't handle giant data records might make use of this, but the user might still want to capture unelided records to a file on disk via another audit device.

  • I expect the usual motivation for this feature to be because a particular audit device handles giant records poorly, or breaks entirely. In such a case, you really want to be confident you've fully mitigated the problem, rather than having to reactively respond to particular cases of giant list responses, which may newly arise as usage patterns of Vault change. In particular, if delegated access to create new mounts - e.g. via Vault namespaces - is in use, it may be difficult to ensure that suitable mount tuning options are used for all mounts.

  • Not all mounts even allow tuning at all - e.g. today, even in Vault 1.12, identity/ and sys/ can't be tuned at all. I do remember seeing a PR where someone was working on fixing this, so maybe it'll no longer be an issue in future, but in the short term, this is also another issue with the mount tuning approach.

I do agree that more flexibility is often a good thing - people do tend to come up with use-cases that beyond what designers originally imagined - but if this was to be able to be set in multiple places, we'd have to figure out overriding/inheritance semantics, which could make it complex to understand how to use.

Ultimately, I think the reason that a simple boolean per audit device could be sufficient in this case, is because list operations are by design constrained to work in a fairly similar way throughout Vault, regardless of the particular backend involved - they contain only a keys (and in select cases key_info) field in data, and I can't think of any case where a list would contain strongly confidential data.

@ncabatoff
Copy link
Collaborator

I'm persuaded. We can leave finer-grained configuration for a future iteration if needed.

(I have been wondering whether it would be practical to add live reconfiguration to audit devices...)

We're open to the idea of this, i.e. I don't think there's any reason why it hasn't been done until now other than no one's asked for it. Should be its own PR of course, and since it will require changes on the enterprise repo side, we'd have to do some work even if you got the ball rolling, but I don't think that'll be a problem, it just may not happen as quickly as one might like.

@cipherboy
Copy link
Contributor

Just a note, but in the PKI plugin, not only would this be a good idea (as cert lists can get long), but finer grained config would be good too: we recommend un-HMACing the certificate field to allow requests for cert issuance to be audited including the full body of the certificate (which is generally public information anyways).

However, various endpoints like /cert/crl return non-certificate data in that parameter as well, and the CRL in particular can grow large enough to panic syslog-only audit setups. The CRL is global and isn't per-user, so there's no real need to un-HMAC the certificate field on this response. My 2c.

@maxb
Copy link
Contributor Author

maxb commented Dec 20, 2022

@cipherboy Perhaps we should deprecate those endpoints in the PKI secrets engine and add replacements that are separate from the /cert/+ APIs? I had just assumed, that given how the audit_non_hmac_* options were designed, someone had at some point made a concious decision that plugin authors were required to exercise consistency in the naming of their request/response fields across different endpoints, so that the audit configuration complexity could be limited. I've not seen this written down anywhere, it's just a feeling I get from looking at the shape of the implementation. I fear much finer-grained configuration would start getting too cumbersome to use, especially as the precedent is to configure this via the API, rather than a configuration file.

@cipherboy
Copy link
Contributor

cipherboy commented Dec 20, 2022

@maxb There are already replacements, under the newer /issuer/:ref endpoints, that do not write the CRL to the certificate field.

I fear that any automation which pulls from cert/crl's JSON endpoint would need to be migrated if we deprecate it, which would be less than ideal and prone to breaking (likely a large number of) customers.

@vercel
Copy link

vercel bot commented Dec 21, 2022

@maxb is attempting to deploy a commit to the HashiCorp Team on Vercel.

A member of the Team first needs to authorize it.

@maxb
Copy link
Contributor Author

maxb commented Dec 21, 2022

I fear that any automation which pulls from cert/crl's JSON endpoint would need to be migrated if we deprecate it, which would be less than ideal and prone to breaking (likely a large number of) customers.

Well, I meant, deprecate and never remove, similar to how there's no plan that I know of to ever remove sys/renew even though it is canonically now at sys/leases/renew, or the undocumented obsolete identity/alias and identity/persona endpoints... but, this is beyond the scope of this PR, which deliberately avoids getting entangled in such wider complex configuration concerns! I'd be happy to continue the conversation in a separate issue if you like, though.


Back on topic for this PR, I've written tests... and, ahem, fixed the bug they revealed.

I'll work on the documentation update next.

maxb added 2 commits December 21, 2022 10:57
Also fix existing documentation which erroneously claimed the `mode`
option was usable with the `socket` and `syslog` backends.
@maxb maxb requested a review from tjperry07 as a code owner December 21, 2022 14:34
@maxb
Copy link
Contributor Author

maxb commented Dec 21, 2022

Documentation added.

I believe this PR is now ready for review and merge.

Copy link
Collaborator

@ncabatoff ncabatoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, I have no issues with the code, and I appreciate the nice tests and docs. We can merge as is, but I have one suggestion for you to consider first: maybe rather than replace the values for keys and key_info, it would be better to suppress them and introduce a new field, e.g. keys_count. Rationale: some folks might prefer if the type of these fields didn't change from list to int when the new config option is toggled, potentially breaking audit log parsing or ingestion.

@maxb
Copy link
Contributor Author

maxb commented Jan 10, 2023

Thanks for the feedback! Let me explain why I did it this way, and then you can tell me which trade-off you would prefer to prioritise.

Currently, there is a loose convention that every list response will contain a keys list, and a select few will also include a key_info map. I know of no list responses that contain other data fields, but there is no guarantee that this will always be the case - I wanted to have simple defined behaviour if other fields were present, and chose to just leave them entirely intact.

If I was to write a new field keys_count, I'd need to decide what should happen if the original response already contained a keys_count - overwrite it? Skip the elision process? I could dodge around this a bit by writing an elided_data_keys_count field into the root of the response (outside of the data map), I guess.

I also quite like the appeal of still being able to tell whether a particular response contained a key_info, or only a keys field. That is accomplished in quite a natural way in the current PR, but would need some kind of other additional field, if not done that way.

I do like the algorithmic simplicity of the current approach, but I'm OK with swapping it out for something a bit more complex and verbose, if you feel it's going to confuse some users too much.

@ncabatoff
Copy link
Collaborator

Currently, there is a loose convention that every list response will contain a keys list, and a select few will also include a key_info map. I know of no list responses that contain other data fields, but there is no guarantee that this will always be the case - I wanted to have simple defined behaviour if other fields were present, and chose to just leave them entirely intact.

That's fair, and it's certainly a concern I had too in proposing this. In my mind it seemed unlikely, but the future is always uncertain.

If I was to write a new field keys_count, I'd need to decide what should happen if the original response already contained a keys_count - overwrite it? Skip the elision process? I could dodge around this a bit by writing an elided_data_keys_count field into the root of the response (outside of the data map), I guess.

If we go this route, I'd say overwriting keys_count is a nonstarter, skipping elision is an option, and putting it outside the data map would be my first choice.

I also quite like the appeal of still being able to tell whether a particular response contained a key_info, or only a keys field. That is accomplished in quite a natural way in the current PR, but would need some kind of other additional field, if not done that way.

I do like the algorithmic simplicity of the current approach, but I'm OK with swapping it out for something a bit more complex and verbose, if you feel it's going to confuse some users too much.

No, I'm persuaded. The concern I have may not arise (people may not care enough about list responses to be digesting them in their pipelines), the new behaviour you're introducing is opt-in, so people using it can always adapt their ingestion code/schemas, and I don't want to impose extra complexity before we know that it's warranted.

@ncabatoff ncabatoff added this to the 1.13.0-rc1 milestone Jan 11, 2023
@ncabatoff ncabatoff merged commit aeb1b1e into hashicorp:main Jan 11, 2023
@maxb maxb deleted the elide_list_responses branch January 11, 2023 21:50
AnPucel pushed a commit that referenced this pull request Feb 3, 2023
This PR relates to a feature request logged through HashiCorp commercial
support.

Vault lacks pagination in its APIs. As a result, certain list operations
can return **very** large responses.  The user's chosen audit sinks may
experience difficulty consuming audit records that swell to tens of
megabytes of JSON.

In our case, one of the systems consuming audit log data could not cope,
and failed.

The responses of list operations are typically not very interesting, as
they are mostly lists of keys, or, even when they include a "key_info"
field, are not returning confidential information. They become even less
interesting once HMAC-ed by the audit system.

Some example Vault "list" operations that are prone to becoming very
large in an active Vault installation are:

    auth/token/accessors/
    identity/entity/id/
    identity/entity-alias/id/
    pki/certs/

In response, I've coded a new option that can be applied to audit
backends, `elide_list_responses`. When enabled, response data is elided
from audit logs, only when the operation type is "list".

For added safety, the elision only applies to the "keys" and "key_info"
fields within the response data - these are conventionally the only
fields present in a list response - see logical.ListResponse, and
logical.ListResponseWithInfo. However, other fields are technically
possible if a plugin author writes unusual code, and these will be
preserved in the audit log even with this option enabled.

The elision replaces the values of the "keys" and "key_info" fields with
an integer count of the number of entries. This allows even the elided
audit logs to still be useful for answering questions like "Was any data
returned?" or "How many records were listed?".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants