tls: Fix background certificate renewals that use TLS-SNI challenge #1366
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The loop which performs renewals in the background obtains a read lock
on the certificate cache map, so that it can be safely iterated. Before
this fix, it would perform the renewals in the read lock. This has been
fine, except that the TLS-SNI challenge, when invoked after Caddy has
already started, requires adding a certificate to the cache. Doing this
requires an exclusive write lock. But it cannot obtain a write lock
because a read lock is obtained higher in the stack, while the loop
iterates. In other words, it's a deadlock.
I was able to reproduce this issue consistently locally, after jumping
through many hoops to force a renewal in a short time that bypasses
Let's Encrypt's authz caching. I was also able to verify that by queuing
renewals (like we do deletions and OCSP updates), lock contention is
relieved and the deadlock is avoided.
This only affects background renewals where the TLS-SNI(-01) challenge
are used. Users report seeing strange errors in the logs after this
happens ("tls: client offered an unsupported, maximum protocol version
of 301"), but I was not able to reproduce these locally. I was also not
able to reproduce the leak of sockets which are left in CLOSE_WAIT.
I am not sure if those are symptoms of running in production on Linux
and are related to this bug, or not.
Either way, this is an important fix. I do not yet know the ripple
effects this will have on other symptoms we've been chasing. But it
definitely resolves a deadlock during renewals.