Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRL config entries cannot be saved due to large size #22980

Closed
jbrandhorst opened this issue Sep 11, 2023 · 7 comments
Closed

CRL config entries cannot be saved due to large size #22980

jbrandhorst opened this issue Sep 11, 2023 · 7 comments
Labels
bug Used to indicate a potential bug secret/pki

Comments

@jbrandhorst
Copy link

Describe the bug
We are running Consul Connect, with Vault in use as the Connect CA. DynamoDB is used for Vault storage. When Consul switches leadership it does a "CA initialization" routine, which generates a new PKI issuer each time. Over time the number of issuers grows, and eventually it reaches a point where the CA initialization fails, because the CRL rebuild step tries saving a CRL config entry so large that it exceeds the DynamoDB max record size, and Consul CA initialization does not complete on the Consul side.

This is the error that Consul reports, which includes the embedded error strings from Vault:

"level":"error"

"message":"Failed to initialize Connect CA","module":"connect.ca","error":"error generating intermediate cert: Error making API request.URL: PUT https://vault/v1/<pki name>/intermediate/set-signed Code: 500. 
Errors:* 1 error occurred:* Rebuilding the CRL failed. While this is indicative of a problem with the imported issuers (perhaps because of their revocation_signature_algorithm), they did import successfully and are now usable. 
It is strongly suggested to fix the CRL building errors before continuing. 
The original error is reproduced below:error building CRLs: unable to persist updated cluster-local CRL config: ValidationException: Item size has exceeded the maximum allowed size status code: 400"

"routine":"CA initialization"

Based on the specific error strings included in the log, the problem is occurring when an internalCRLConfigEntry is written to DynamoDB:

type internalCRLConfigEntry struct {
IssuerIDCRLMap map[issuerID]crlID `json:"issuer_id_crl_map"`
CRLNumberMap map[crlID]int64 `json:"crl_number_map"`
LastCompleteNumberMap map[crlID]int64 `json:"last_complete_number_map"`
CRLExpirationMap map[crlID]time.Time `json:"crl_expiration_map"`
LastModified time.Time `json:"last_modified"`
DeltaLastModified time.Time `json:"delta_last_modified"`
UseGlobalQueue bool `json:"cross_cluster_revocation"`
}

After analysis of the entries in the DynamoDB table, at the path logical/<secret engine UUID>/crls there is an item with a key config where this config entry is stored. When the CRL build occurs it cannot save the entry to DynamoDB because the size is now larger than DynamoDB's max field size of 400KB.

To Reproduce
Steps to reproduce the behavior:

  1. Set up Consul Connect with Vault as the CA
  2. Let a number of Consul leadership transitions occur over time
  3. The number of issuers in Vault under the path /v1/<pki name>/issuer/<UUID>/ grows
  4. Eventually a leadership transition in Consul occurs and the error above will occur because the CRL config cannot be saved when Consul Connect does its ca initialization.

We manually deleted a number of the issuers from DynamoDB in order to get past the condition and get Consul functioning again, as there did not seem to be any operator commands that would help (tidy did not help because the issuers were not expired yet)

Expected behavior
We would like to know:

  1. Our CRLs are actually empty, can we simply disable CRL builds to avoid running into this problem?
  2. Manually deleting issuer records from DynamoDB seems like a less-than-ideal remediation step, are there any vault operator commands that can address this type of issue?
  3. We intend to modify the TTL of the Consul Connect CA intermediate cert to hopefully cut down the number of active issuers, it is currently still set as the Consul default of 1 year (8760h). Are there any other configuration settings we should look into that can reduce the size of the CRL config entry that is stored in DynamoDB?

Environment:

Vault server configuration file(s):

(let me know what specific hcl is helpful here and I can edit the original post)

Additional context

  1. It seems that when manually deleting old issuers, the amount of space freed in the CRL config entry (~78 bytes) is less than the amount of space consumed in the storage when a new issuer is added (~240 bytes). So even if you delete old issuers regularly the size of the CRL config will still continue to grow and may eventually run into this problem.

  2. When Consul Connect is unable complete it's "CA Initialization" when encountering the CRL max record size issue it tries to reinitialize repeatedly. These attempts to initialize actually generates new issuers in Vault each time, but meanwhile Consul Connect does not actually function when it encounters the error. The repeated creation of more issuers that are unused by Consul compounds the problem, and then requires a lot more manual cleanup of the issuers that were created by Consul when Connect was in a non-functioning state.

@jkirschner-hashicorp
Copy link
Contributor

Hi @jbrandhorst,

I just re-opened a related issue on the Consul repo that your post suggests was not fully closed.

For context, how many issuers were within Vault's /v1/<pki name> when you hit the reported error from Consul?

@jbrandhorst
Copy link
Author

Hi @jbrandhorst,

I just re-opened a related issue on the Consul repo that your post suggests was not fully closed.

For context, how many issuers were within Vault's /v1/<pki name> when you hit the reported error from Consul?

Thanks @jkirschner-hashicorp - there were 1850 issuers

@cipherboy cipherboy added secret/pki bug Used to indicate a potential bug labels Sep 12, 2023
cipherboy added a commit that referenced this issue Sep 12, 2023
When a issuer is removed, the space utilized by its CRL was not freed,
both from the CRL config mapping issuer IDs to CRL IDs and from the
CRL storage entry. We thus implement a two step cleanup, wherein
orphaned CRL IDs are removed from the config and any remaining full
CRL entries are removed from disk.

This relates to a Consul<->Vault interop issue (#22980), wherein Consul
creates a new issuer on every leadership election, causing this config
to grow. Deleting issuers manually does not entirely solve this problem
as the config does not fully reclaim space used in this entry.

Notably, an observation that when deleting issuers, the CRL was rebuilt
on secondary clusters (due to the invalidation not caring about type of
the operation); for consistency and to clean up the unified CRLs, we
also need to run the rebuild on the active primary cluster that deleted
the issuer as well.

This approach does allow cleanup on existing impacted clusters by simply
rebuilding the CRL.

Co-authored-by: Steven Clark <[email protected]>
Signed-off-by: Alexander Scheel <[email protected]>
@cipherboy
Copy link
Contributor

@jbrandhorst Many thanks for filing this, definitely agree that (1) above is a bug we want to address on the Vault side; that will be fixed in #23007.

cipherboy added a commit that referenced this issue Sep 12, 2023
* Clean up unused CRL entries when issuer is removed

When a issuer is removed, the space utilized by its CRL was not freed,
both from the CRL config mapping issuer IDs to CRL IDs and from the
CRL storage entry. We thus implement a two step cleanup, wherein
orphaned CRL IDs are removed from the config and any remaining full
CRL entries are removed from disk.

This relates to a Consul<->Vault interop issue (#22980), wherein Consul
creates a new issuer on every leadership election, causing this config
to grow. Deleting issuers manually does not entirely solve this problem
as the config does not fully reclaim space used in this entry.

Notably, an observation that when deleting issuers, the CRL was rebuilt
on secondary clusters (due to the invalidation not caring about type of
the operation); for consistency and to clean up the unified CRLs, we
also need to run the rebuild on the active primary cluster that deleted
the issuer as well.

This approach does allow cleanup on existing impacted clusters by simply
rebuilding the CRL.

Co-authored-by: Steven Clark <[email protected]>
Signed-off-by: Alexander Scheel <[email protected]>

* Add test case on CRL removal

Signed-off-by: Alexander Scheel <[email protected]>

* Add changelog entry

Signed-off-by: Alexander Scheel <[email protected]>

---------

Signed-off-by: Alexander Scheel <[email protected]>
Co-authored-by: Steven Clark <[email protected]>
@jkirschner-hashicorp
Copy link
Contributor

PR in progress on the Consul side: hashicorp/consul#18773

@ksmiley
Copy link

ksmiley commented Sep 13, 2023

I noticed there's an EncodeJSONAndCompress that seems to be a drop-in replacement for for the uncompressed EncodeJSON that this record uses currently. Could that be used here? There's a decent amount of repetition in the CRL config JSON blob (e.g. each CRL identifier is repeated four times), so it should compress well. It's not exactly a fix, but it would at least increase the number of issuers that it takes to hit the storage limit.

@cipherboy
Copy link
Contributor

cipherboy commented Sep 13, 2023

\o hey @ksmiley

Interesting thought... In general, simple managed rotation like this without cross signing is the easy case to trigger this. As the number of issuers grow, the automated chain building that occurs when issuers are added will be much slower than storage limitations... I think O(n log n) in the average case but IIRC O(n^3 log n) in the worst case for issuers with lots of small-ish cliques? If Consul instead switched to cross-signing, this would also slow down chain building and likely run into context deadline exceeded errors, rather than failure to write to storage due to size limitations... So I'd probably avoid having hundres or thousands of issuers within a single mount and stick to tens or at most low hundreds of issuers anyways (also, CRL building & certificate->issuer attribution gets slower with this many issuers as well, except if there are no revocations but even that requires writing & updating empty CRLs).

(Also, as an aside, the permissions model also gets a bit whacky IMO... Doable, but hard to manage).

So you're absolutely right that it could help with this case, but I think you might not want a thousand or more in the mount anyways :-)

VioletHynes pushed a commit that referenced this issue Sep 19, 2023
* Clean up unused CRL entries when issuer is removed

When a issuer is removed, the space utilized by its CRL was not freed,
both from the CRL config mapping issuer IDs to CRL IDs and from the
CRL storage entry. We thus implement a two step cleanup, wherein
orphaned CRL IDs are removed from the config and any remaining full
CRL entries are removed from disk.

This relates to a Consul<->Vault interop issue (#22980), wherein Consul
creates a new issuer on every leadership election, causing this config
to grow. Deleting issuers manually does not entirely solve this problem
as the config does not fully reclaim space used in this entry.

Notably, an observation that when deleting issuers, the CRL was rebuilt
on secondary clusters (due to the invalidation not caring about type of
the operation); for consistency and to clean up the unified CRLs, we
also need to run the rebuild on the active primary cluster that deleted
the issuer as well.

This approach does allow cleanup on existing impacted clusters by simply
rebuilding the CRL.

Co-authored-by: Steven Clark <[email protected]>
Signed-off-by: Alexander Scheel <[email protected]>

* Add test case on CRL removal

Signed-off-by: Alexander Scheel <[email protected]>

* Add changelog entry

Signed-off-by: Alexander Scheel <[email protected]>

---------

Signed-off-by: Alexander Scheel <[email protected]>
Co-authored-by: Steven Clark <[email protected]>
@cipherboy
Copy link
Contributor

This seems to have been addressed on both sides and will be in the next set of releases. Closing, thanks all! :-)

VioletHynes added a commit that referenced this issue Oct 6, 2023
* VAULT-19237 Add mount_type to secret response

* VAULT-19237 changelog

* VAULT-19237 make MountType generic

* VAULT-19237 clean up comment

* VAULT-19237 update changelog

* VAULT-19237 update test, remove mounttype from wrapped responses

* VAULT-19237 fix a lot of tests

* VAULT-19237 standby test

* ensure -log-level is added to core config (#23017)

* Feature/document tls servername (#22714)

* Add Raft TLS Helm examples

Co-authored-by: Pascal Reeb <[email protected]>
---------

* Clean up unused CRL entries when issuer is removed (#23007)

* Clean up unused CRL entries when issuer is removed

When a issuer is removed, the space utilized by its CRL was not freed,
both from the CRL config mapping issuer IDs to CRL IDs and from the
CRL storage entry. We thus implement a two step cleanup, wherein
orphaned CRL IDs are removed from the config and any remaining full
CRL entries are removed from disk.

This relates to a Consul<->Vault interop issue (#22980), wherein Consul
creates a new issuer on every leadership election, causing this config
to grow. Deleting issuers manually does not entirely solve this problem
as the config does not fully reclaim space used in this entry.

Notably, an observation that when deleting issuers, the CRL was rebuilt
on secondary clusters (due to the invalidation not caring about type of
the operation); for consistency and to clean up the unified CRLs, we
also need to run the rebuild on the active primary cluster that deleted
the issuer as well.

This approach does allow cleanup on existing impacted clusters by simply
rebuilding the CRL.

Co-authored-by: Steven Clark <[email protected]>
Signed-off-by: Alexander Scheel <[email protected]>

* Add test case on CRL removal

Signed-off-by: Alexander Scheel <[email protected]>

* Add changelog entry

Signed-off-by: Alexander Scheel <[email protected]>

---------

Signed-off-by: Alexander Scheel <[email protected]>
Co-authored-by: Steven Clark <[email protected]>

* UI: Handle control group error on SSH (#23025)

* Handle control group error on SSH

* Add changelog

* Fix enterprise failure of TestCRLIssuerRemoval (#23038)

This fixes the enterprise failure of the test
 ```
  === FAIL: builtin/logical/pki TestCRLIssuerRemoval (0.00s)
     crl_test.go:1456:
         	Error Trace:	/home/runner/actions-runner/_work/vault-enterprise/vault-enterprise/builtin/logical/pki/crl_test.go:1456
         	Error:      	Received unexpected error:
         	            	Global, cross-cluster revocation queue cannot be enabled when auto rebuilding is disabled as the local cluster may not have the certificate entry!
         	Test:       	TestCRLIssuerRemoval
         	Messages:   	failed enabling unified CRLs on enterprise

 ```

* fix LDAP auto auth changelog (#23027)

* VAULT-19233 First part of caching static secrets work

* VAULT-19233 update godoc

* VAULT-19233 invalidate cache on non-GET

* VAULT-19233 add locking to proxy cache writes

* VAULT-19233 add caching of capabilities map, and some additional test coverage

* VAULT-19233 Additional testing

* VAULT-19233 namespaces for cache ids

* VAULT-19233 cache-clear testing and implementation

* VAULT-19233 adjust format, add more tests

* VAULT-19233 some more docs

* VAULT-19233 Add RLock holding for map access

* VAULT-19233 PR comments

* VAULT-19233 Different table for capabilities indexes

* VAULT-19233 keep unique for request path

* VAULT-19233 passthrough for non-v1 requests

* VAULT-19233 some renames/PR comment updates

* VAULT-19233 remove type from capabilities index

* VAULT-19233 remove obsolete capabilities

* VAULT-19233 remove erroneous capabilities

* VAULT-19233 woops, missed a test

* VAULT-19233 typo

* VAULT-19233 add custom error for cachememdb

* VAULT-19233 fix cachememdb test

---------

Signed-off-by: Alexander Scheel <[email protected]>
Co-authored-by: Chris Capurso <[email protected]>
Co-authored-by: Andreas Gruhler <[email protected]>
Co-authored-by: Alexander Scheel <[email protected]>
Co-authored-by: Steven Clark <[email protected]>
Co-authored-by: Chelsea Shaw <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug secret/pki
Projects
None yet
Development

No branches or pull requests

4 participants