-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add MustRevalidate flag to connect_ca_leaf cache type; always use on non-blocking queries #11693
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Some thoughts/suggestions
agent/agent_endpoint.go
Outdated
// We don't want non-blocking queries to return expired leaf certs | ||
// or leaf certs not valid under the current CA. So always revalidate | ||
// the leaf cert on non-blocking queries | ||
if args.MinQueryIndex == 0 { | ||
args.MustRevalidate = true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this logic might be better in ConnectCALeafRequest.CacheInfo()
. I believe that method should have access to all the same data.
By moving it there we can be sure that all users of the cache do the right thing, instead of only this one HTTP API endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm going to walk this back.
Here's what I ran into -- there's a non trivial amount of ConnectCALeaf testing that would need to change as a result of always passing in MustRevalidate
assignment logic in the CacheInfo(). It kinda feels like this behavior is "breaking" with previous one. It seems, ergonomically, this func was designed more as a "getter".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, that is unfortunate. I think those tests must not be accurately testing the production behaviour very well, but that is cleanup we can leave for another time I think. Thanks for trying.
Maybe leave a TODO here instead? Not critical
IPSAN []net.IP | ||
MinQueryIndex uint64 | ||
MaxQueryTime time.Duration | ||
MustRevalidate bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we move the logic into CacheInfo
(suggested in the other comment) then we may be able to avoid adding this field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed walking back this change here --> #11693 (comment)
agent/cache-types/connect_ca_leaf.go
Outdated
// If we called Fetch() with MustRevalidate then this call came from a non-blocking query. | ||
// To pickup any CA rotations that may have happened prior to this call, we will reduce the timeout | ||
// to ~ 1 sec. Otherwise, we risk non-blocking queries taking ~ 10 minutes to wait on a CA rotation | ||
// that could never happen. | ||
// see https://github.com/hashicorp/consul/issues/10871, https://github.com/hashicorp/consul/issues/9862 | ||
// This is a hack, so, ideally, we can revisit the caching/ fetching logic in this case | ||
if req.CacheInfo().MustRevalidate { | ||
timeoutCh = time.After(time.Second) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this works, but I also wonder if we could make the non-blocking case a bit more obvious by avoiding the select entirely.
What if we were to:
- move this
if
above where we createrootUpdateCh
around line 382 - remove the second
if
below (we'll handle it in this same if block) - instad of
timeoutCh = time.After(time.Second)
we do something like this:
roots, err := c.rootsFromCache()
if err != nil {
return lastResultWithNewState(), err
}
if activeRootHasKey(roots, state.authorityKeyID) {
return lastResultWithNewState(), err
}
... // some shared logic here to calculate expiresAt
if state.forceExpireAfter.Before(expiresAt) {
return c.generateNewLeaf(reqReal, lastResultWithNewState())
}
return lastResultWithNewState(), err
The ...
about calculating expiresAt
could be extracted from the 461-470 into a new function and called from both places.
I haven't fully gone through that to be sure that works, but my impression is that something like that would work and allow us to avoid the select
which makes it a bit more obvious that we are handling a non-blocking request, without having to set a short timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey thanks for bringing this up again -- I recall we chatted about this approach and dropped the ball as I was debugging the "vault issue". 💯
I think it makes a lot of sense (readability, maintainability) to treat the MustRevalidate
/ non-blocking case as "separate" from the select stmt.
Immediately after we check :
consul/agent/cache-types/connect_ca_leaf.go
Lines 378 to 381 in ed8a901
if expiresAt.Equal(now) || expiresAt.Before(now) { | |
// Already expired, just make a new one right away | |
return c.generateNewLeaf(reqReal, lastResultWithNewState()) | |
} |
If we add something like ~:
if req.CacheInfo().MustRevalidate {
roots, err := c.rootsFromCache()
if err != nil {
return lastResultWithNewState(), err
}
if activeRootHasKey(roots, state.authorityKeyID) {
return lastResultWithNewState(), nil
}
// if we reach here then the current leaf was not signed by the same CAs, just regen
return c.generateNewLeaf(reqReal, lastResultWithNewState())
}
That will catch most occurrences of #10871. Sometimes, the CA may take a bit to be rotated but that's ok. "Certainty" / " strong consistency" is what the blocking queries are for.
Also note, that we don't need to "calculate" when to regen the cert. If for whatever reason we don't have a CA that signed the leaf, we need to regen the leaf cert.
Wdyt?
c1c8f4a
to
089fb61
Compare
089fb61
to
c2090e4
Compare
Signed-off-by: FFMMM <[email protected]>
c2090e4
to
60b7dcb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, LGTM!
One small suggestion for the loop in the test. I think it might be worth to look into moving the MustRevalidate
into the CacheInfo stuff, but that is not urgent, so would be fine as a TODO I think.
Co-authored-by: Daniel Nephin <[email protected]>
Signed-off-by: FFMMM <[email protected]>
215b7c7
to
f7081c1
Compare
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/513605. |
…non-blocking queries (#11693) * always use MustRevalidate on non-blocking queries for connect ca leaf Signed-off-by: FFMMM <[email protected]> * Update agent/agent_endpoint_test.go Co-authored-by: Daniel Nephin <[email protected]> * pr feedback Signed-off-by: FFMMM <[email protected]> Co-authored-by: Daniel Nephin <[email protected]>
…non-blocking queries (#11693) * always use MustRevalidate on non-blocking queries for connect ca leaf Signed-off-by: FFMMM <[email protected]> * Update agent/agent_endpoint_test.go Co-authored-by: Daniel Nephin <[email protected]> * pr feedback Signed-off-by: FFMMM <[email protected]> Co-authored-by: Daniel Nephin <[email protected]>
…non-blocking queries (#11693) (#11714) * always use MustRevalidate on non-blocking queries for connect ca leaf Signed-off-by: FFMMM <[email protected]> * Update agent/agent_endpoint_test.go Co-authored-by: Daniel Nephin <[email protected]> * pr feedback Signed-off-by: FFMMM <[email protected]> Co-authored-by: Daniel Nephin <[email protected]> Co-authored-by: Daniel Nephin <[email protected]>
…non-blocking queries (#11693) (#11715) * always use MustRevalidate on non-blocking queries for connect ca leaf Signed-off-by: FFMMM <[email protected]> * Update agent/agent_endpoint_test.go Co-authored-by: Daniel Nephin <[email protected]> * pr feedback Signed-off-by: FFMMM <[email protected]> Co-authored-by: Daniel Nephin <[email protected]> Co-authored-by: Daniel Nephin <[email protected]>
…king leaf cert query to block (#12820) Fixes #12048 Fixes #12319 Regression introduced in #11693 Local reproduction steps: 1. `consul agent -dev` 2. `curl -sLiv 'localhost:8500/v1/agent/connect/ca/leaf/web'` 3. make note of the `X-Consul-Index` header returned 4. `curl -sLi 'localhost:8500/v1/agent/connect/ca/leaf/web?index=<VALUE_FROM_STEP_3>'` 5. Kill the above curl when it hangs with Ctrl-C 6. Repeat (2) and it should not hang.
…king leaf cert query to block (#12820) Fixes #12048 Fixes #12319 Regression introduced in #11693 Local reproduction steps: 1. `consul agent -dev` 2. `curl -sLiv 'localhost:8500/v1/agent/connect/ca/leaf/web'` 3. make note of the `X-Consul-Index` header returned 4. `curl -sLi 'localhost:8500/v1/agent/connect/ca/leaf/web?index=<VALUE_FROM_STEP_3>'` 5. Kill the above curl when it hangs with Ctrl-C 6. Repeat (2) and it should not hang.
…king leaf cert query to block (#12820) Fixes #12048 Fixes #12319 Regression introduced in #11693 Local reproduction steps: 1. `consul agent -dev` 2. `curl -sLiv 'localhost:8500/v1/agent/connect/ca/leaf/web'` 3. make note of the `X-Consul-Index` header returned 4. `curl -sLi 'localhost:8500/v1/agent/connect/ca/leaf/web?index=<VALUE_FROM_STEP_3>'` 5. Kill the above curl when it hangs with Ctrl-C 6. Repeat (2) and it should not hang.
…king leaf cert query to block (#12820) Fixes #12048 Fixes #12319 Regression introduced in #11693 Local reproduction steps: 1. `consul agent -dev` 2. `curl -sLiv 'localhost:8500/v1/agent/connect/ca/leaf/web'` 3. make note of the `X-Consul-Index` header returned 4. `curl -sLi 'localhost:8500/v1/agent/connect/ca/leaf/web?index=<VALUE_FROM_STEP_3>'` 5. Kill the above curl when it hangs with Ctrl-C 6. Repeat (2) and it should not hang.
Overview
PR addresses #10871 / #9862 .
Today, once we HIT/ load up a leaf cert in the agent cache, we can return expired certs or certs signed by a previously rotated CA.
These changes introduces a
MustRevalidate
flag to theconnect_ca_leaf
cache type that we always use on non-blocking queries.Notes
Always setting
MustRevalidate
on non blocking queries means we always "MISS" the cache on non blocking queries. In effect, the headers are now redundant.PR checklist
gofmt
[ ] docs update--> I want to pull this into a separate PR bc I want to change some of the docs wording