-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRL config entries cannot be saved due to large size #22980
Comments
Hi @jbrandhorst, I just re-opened a related issue on the Consul repo that your post suggests was not fully closed. For context, how many issuers were within Vault's |
Thanks @jkirschner-hashicorp - there were 1850 issuers |
When a issuer is removed, the space utilized by its CRL was not freed, both from the CRL config mapping issuer IDs to CRL IDs and from the CRL storage entry. We thus implement a two step cleanup, wherein orphaned CRL IDs are removed from the config and any remaining full CRL entries are removed from disk. This relates to a Consul<->Vault interop issue (#22980), wherein Consul creates a new issuer on every leadership election, causing this config to grow. Deleting issuers manually does not entirely solve this problem as the config does not fully reclaim space used in this entry. Notably, an observation that when deleting issuers, the CRL was rebuilt on secondary clusters (due to the invalidation not caring about type of the operation); for consistency and to clean up the unified CRLs, we also need to run the rebuild on the active primary cluster that deleted the issuer as well. This approach does allow cleanup on existing impacted clusters by simply rebuilding the CRL. Co-authored-by: Steven Clark <[email protected]> Signed-off-by: Alexander Scheel <[email protected]>
@jbrandhorst Many thanks for filing this, definitely agree that (1) above is a bug we want to address on the Vault side; that will be fixed in #23007. |
* Clean up unused CRL entries when issuer is removed When a issuer is removed, the space utilized by its CRL was not freed, both from the CRL config mapping issuer IDs to CRL IDs and from the CRL storage entry. We thus implement a two step cleanup, wherein orphaned CRL IDs are removed from the config and any remaining full CRL entries are removed from disk. This relates to a Consul<->Vault interop issue (#22980), wherein Consul creates a new issuer on every leadership election, causing this config to grow. Deleting issuers manually does not entirely solve this problem as the config does not fully reclaim space used in this entry. Notably, an observation that when deleting issuers, the CRL was rebuilt on secondary clusters (due to the invalidation not caring about type of the operation); for consistency and to clean up the unified CRLs, we also need to run the rebuild on the active primary cluster that deleted the issuer as well. This approach does allow cleanup on existing impacted clusters by simply rebuilding the CRL. Co-authored-by: Steven Clark <[email protected]> Signed-off-by: Alexander Scheel <[email protected]> * Add test case on CRL removal Signed-off-by: Alexander Scheel <[email protected]> * Add changelog entry Signed-off-by: Alexander Scheel <[email protected]> --------- Signed-off-by: Alexander Scheel <[email protected]> Co-authored-by: Steven Clark <[email protected]>
PR in progress on the Consul side: hashicorp/consul#18773 |
I noticed there's an |
\o hey @ksmiley Interesting thought... In general, simple managed rotation like this without cross signing is the easy case to trigger this. As the number of issuers grow, the automated chain building that occurs when issuers are added will be much slower than storage limitations... I think (Also, as an aside, the permissions model also gets a bit whacky IMO... Doable, but hard to manage). So you're absolutely right that it could help with this case, but I think you might not want a thousand or more in the mount anyways :-) |
* Clean up unused CRL entries when issuer is removed When a issuer is removed, the space utilized by its CRL was not freed, both from the CRL config mapping issuer IDs to CRL IDs and from the CRL storage entry. We thus implement a two step cleanup, wherein orphaned CRL IDs are removed from the config and any remaining full CRL entries are removed from disk. This relates to a Consul<->Vault interop issue (#22980), wherein Consul creates a new issuer on every leadership election, causing this config to grow. Deleting issuers manually does not entirely solve this problem as the config does not fully reclaim space used in this entry. Notably, an observation that when deleting issuers, the CRL was rebuilt on secondary clusters (due to the invalidation not caring about type of the operation); for consistency and to clean up the unified CRLs, we also need to run the rebuild on the active primary cluster that deleted the issuer as well. This approach does allow cleanup on existing impacted clusters by simply rebuilding the CRL. Co-authored-by: Steven Clark <[email protected]> Signed-off-by: Alexander Scheel <[email protected]> * Add test case on CRL removal Signed-off-by: Alexander Scheel <[email protected]> * Add changelog entry Signed-off-by: Alexander Scheel <[email protected]> --------- Signed-off-by: Alexander Scheel <[email protected]> Co-authored-by: Steven Clark <[email protected]>
This seems to have been addressed on both sides and will be in the next set of releases. Closing, thanks all! :-) |
* VAULT-19237 Add mount_type to secret response * VAULT-19237 changelog * VAULT-19237 make MountType generic * VAULT-19237 clean up comment * VAULT-19237 update changelog * VAULT-19237 update test, remove mounttype from wrapped responses * VAULT-19237 fix a lot of tests * VAULT-19237 standby test * ensure -log-level is added to core config (#23017) * Feature/document tls servername (#22714) * Add Raft TLS Helm examples Co-authored-by: Pascal Reeb <[email protected]> --------- * Clean up unused CRL entries when issuer is removed (#23007) * Clean up unused CRL entries when issuer is removed When a issuer is removed, the space utilized by its CRL was not freed, both from the CRL config mapping issuer IDs to CRL IDs and from the CRL storage entry. We thus implement a two step cleanup, wherein orphaned CRL IDs are removed from the config and any remaining full CRL entries are removed from disk. This relates to a Consul<->Vault interop issue (#22980), wherein Consul creates a new issuer on every leadership election, causing this config to grow. Deleting issuers manually does not entirely solve this problem as the config does not fully reclaim space used in this entry. Notably, an observation that when deleting issuers, the CRL was rebuilt on secondary clusters (due to the invalidation not caring about type of the operation); for consistency and to clean up the unified CRLs, we also need to run the rebuild on the active primary cluster that deleted the issuer as well. This approach does allow cleanup on existing impacted clusters by simply rebuilding the CRL. Co-authored-by: Steven Clark <[email protected]> Signed-off-by: Alexander Scheel <[email protected]> * Add test case on CRL removal Signed-off-by: Alexander Scheel <[email protected]> * Add changelog entry Signed-off-by: Alexander Scheel <[email protected]> --------- Signed-off-by: Alexander Scheel <[email protected]> Co-authored-by: Steven Clark <[email protected]> * UI: Handle control group error on SSH (#23025) * Handle control group error on SSH * Add changelog * Fix enterprise failure of TestCRLIssuerRemoval (#23038) This fixes the enterprise failure of the test ``` === FAIL: builtin/logical/pki TestCRLIssuerRemoval (0.00s) crl_test.go:1456: Error Trace: /home/runner/actions-runner/_work/vault-enterprise/vault-enterprise/builtin/logical/pki/crl_test.go:1456 Error: Received unexpected error: Global, cross-cluster revocation queue cannot be enabled when auto rebuilding is disabled as the local cluster may not have the certificate entry! Test: TestCRLIssuerRemoval Messages: failed enabling unified CRLs on enterprise ``` * fix LDAP auto auth changelog (#23027) * VAULT-19233 First part of caching static secrets work * VAULT-19233 update godoc * VAULT-19233 invalidate cache on non-GET * VAULT-19233 add locking to proxy cache writes * VAULT-19233 add caching of capabilities map, and some additional test coverage * VAULT-19233 Additional testing * VAULT-19233 namespaces for cache ids * VAULT-19233 cache-clear testing and implementation * VAULT-19233 adjust format, add more tests * VAULT-19233 some more docs * VAULT-19233 Add RLock holding for map access * VAULT-19233 PR comments * VAULT-19233 Different table for capabilities indexes * VAULT-19233 keep unique for request path * VAULT-19233 passthrough for non-v1 requests * VAULT-19233 some renames/PR comment updates * VAULT-19233 remove type from capabilities index * VAULT-19233 remove obsolete capabilities * VAULT-19233 remove erroneous capabilities * VAULT-19233 woops, missed a test * VAULT-19233 typo * VAULT-19233 add custom error for cachememdb * VAULT-19233 fix cachememdb test --------- Signed-off-by: Alexander Scheel <[email protected]> Co-authored-by: Chris Capurso <[email protected]> Co-authored-by: Andreas Gruhler <[email protected]> Co-authored-by: Alexander Scheel <[email protected]> Co-authored-by: Steven Clark <[email protected]> Co-authored-by: Chelsea Shaw <[email protected]>
Describe the bug
We are running Consul Connect, with Vault in use as the Connect CA. DynamoDB is used for Vault storage. When Consul switches leadership it does a "CA initialization" routine, which generates a new PKI issuer each time. Over time the number of issuers grows, and eventually it reaches a point where the CA initialization fails, because the CRL rebuild step tries saving a CRL config entry so large that it exceeds the DynamoDB max record size, and Consul CA initialization does not complete on the Consul side.
This is the error that Consul reports, which includes the embedded error strings from Vault:
Based on the specific error strings included in the log, the problem is occurring when an internalCRLConfigEntry is written to DynamoDB:
vault/builtin/logical/pki/storage.go
Lines 189 to 197 in 2e30ad5
After analysis of the entries in the DynamoDB table, at the path
logical/<secret engine UUID>/crls
there is an item with a keyconfig
where this config entry is stored. When the CRL build occurs it cannot save the entry to DynamoDB because the size is now larger than DynamoDB's max field size of 400KB.To Reproduce
Steps to reproduce the behavior:
/v1/<pki name>/issuer/<UUID>/
growsWe manually deleted a number of the issuers from DynamoDB in order to get past the condition and get Consul functioning again, as there did not seem to be any operator commands that would help (tidy did not help because the issuers were not expired yet)
Expected behavior
We would like to know:
Environment:
vault status
): 1.14.1vault version
): v1.12.4Vault server configuration file(s):
Additional context
It seems that when manually deleting old issuers, the amount of space freed in the CRL config entry (~78 bytes) is less than the amount of space consumed in the storage when a new issuer is added (~240 bytes). So even if you delete old issuers regularly the size of the CRL config will still continue to grow and may eventually run into this problem.
When Consul Connect is unable complete it's "CA Initialization" when encountering the CRL max record size issue it tries to reinitialize repeatedly. These attempts to initialize actually generates new issuers in Vault each time, but meanwhile Consul Connect does not actually function when it encounters the error. The repeated creation of more issuers that are unused by Consul compounds the problem, and then requires a lot more manual cleanup of the issuers that were created by Consul when Connect was in a non-functioning state.
The text was updated successfully, but these errors were encountered: