Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #4421: General solution to stop blocking queries with index 0 #4437

Merged
merged 7 commits into from
Jul 25, 2018

Conversation

banks
Copy link
Member

@banks banks commented Jul 24, 2018

When developing Connect we hit a few edge cases where blocking queries would return X-Consul-Index of 0 along with a blank request.

The root cause is that some of our state functions (for example fetching the CARoots) have no state to load at all in some cases and so have no index value to read and end up returning meta.Index = 0.

In general our blocking query code assumes that no well-behaved blocking query ever has this behaviour and in some places where it's possible (e.g. KV fetches for non-existent keys) we have code to explicitly account for it. Some places account ofr it as above in the RPC endpoint code, some account for it in the actual state store transactions, but many still don't and rely on semantics or just plain unlikeliness of a given situation to never hit it.

During Connect we discussed the idea of fixing this once and for all within the Server.blockingQuery helper that all blocking queries use - we hypothesised this fix and were relatively sure it couldn't cause any damage as it's a fundamental assumption of the blocking query system and the only change in behaviour seems to be that it will prevent a tight busy loop that eats 100% CPU. But ultimately we didn't want to take the risk of adding the more general change that would affect non-connect endpoints.

Our solution which we thought covered all Connect APIs was to add the logic of using at least 1 as the index in the agent's cache fetcher for connect data.

However #4421 is a subtle edge case we missed even in connect endpoints.

The cause there is that the CA RPC response returns an empty result with 0 index, and the client dutifully upgrades that index to 1 for the next request. But the subsequent blocking request still has a 0 index returned from state.CARoots becuas there is still no CA config with connect disabled. The blockingQuery helper here used to check the returned meta was at least one before blocking as a proxy for "should blocking be enabled" so it never blocked in this case and caused the client agent to consume 100% CPU again.

General fix.

The actual fix here is just to switch the zero check from queryMeta.Index > 0 (the value returned from the state method) to queryOpts.MinQueryIndex > 0. As far as I understand this is a better semantic since we already explicitly checked for nil err from the method.

My rationale is:

  • we should block if the client requested it and the method's index is lower even if the state method returns 0
  • yes it's arguably buggy for state method to ever return 0 index with nil err but it happens and probably will again next time we add state methods or use existing ones in new ways where it starts to matter
  • it's never correct for a non-error response to have 0 index since it will cause clients to busy loop without blocking
  • it's therefore reasonable to ensure systematically that no (non-error) RPC response is ever sent with 0 index, and that even if the state method responds with zero index, we still block if requested to until some state is written to deliver back.

Tests

This is actually pretty easy to test and I think the tests added are sufficient. I was expecting to violate a bunch of other tests in the repo that expected zero indexes just because that's what they used to get but all tests pass locally - including all the watch package and all blocking-related stuff!

UX Error message

The original poster of #4421 didn't find the message about 'CA is not bootstrapped' to be clear enough. By adding an explicit enabled check for CARoots RPC the message is much improved:

$ curl -s http://127.0.0.1:8500/v1/agent/connect/ca/leaf/foo
Connect must be enabled in order to use this endpoint

@mitchellh @kyhavlov the Roots RPC was the only one that didn't have this check already. I wonder if it was intentional for some reason - a few tests used the fact to force empty results but with no rationale why that behaviour was desirable. I can't think of a good reason though to return an empty set of roots as if it is a valid request when connect is explicitly disabled? Returning empty if CA is not bootstrapped yet is important for races during bootstrap esp. in dev mode but if it's disabled it seems more helpful to be explicit about that?

@banks banks requested review from mitchellh, preetapan and mkeeler July 24, 2018 14:52
@@ -108,7 +108,8 @@ func (c *ConnectCALeaf) Fetch(opts cache.FetchOptions, req cache.Request) (cache
return result, errors.New("invalid RootCA response type")
}
if roots.TrustDomain == "" {
return result, errors.New("cluster has no CA bootstrapped")
return result, errors.New("cluster has no CA bootstrapped: ensure Connect" +
" is enabled in server config")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my attempt to make it more obvious to users what the cause is (likely) to be. We can't be more explicit since it's possible to hit this just due to timing before the CA bootstrap happens, especially in dev setups where CA bootstrap happens on every server restart and proxies might already be hitting the APIs before it does.

func (c *Cache) entryKey(r *RequestInfo) string {
return fmt.Sprintf("%s/%s/%s", r.Datacenter, r.Token, r.Key)
func (c *Cache) entryKey(t string, r *RequestInfo) string {
return fmt.Sprintf("%s/%s/%s/%s", t, r.Datacenter, r.Token, r.Key)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix for another hypothetical bug I found along the way. It was well discussed with Mitchell on Slack and the test case below highlights why it might be needed one day but can be ignored for purposes of reviewing the blockingQuery fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total nitpick but should the type be the prefix here? We don't do any cache key searching or lookups or anything fancy but least specificity => most usually has benefits. Super hypothetical but feels better.

Copy link
Member Author

@banks banks Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm the type is the prefix here. It seemed to me that type is the most coarse grained partitioning we'd want (cache results from different operation types are entirely separate data sets whereas split by DC or token or key are finer grained.

So I'd agree with your high level point but either I'm not understanding why you think the cache type is less specific than the other elements, or there is confusion caused by the diff rendering highlighting the last /%s when really the new t value is being added as the first segment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, the git diff just threw me off. Hah! Ignore me.

@@ -750,7 +798,8 @@ func TestCacheGet_partitionToken(t *testing.T) {
// testPartitionType implements Type for testing that simply returns a value
// comprised of the request DC and ACL token, used for testing cache
// partitioning.
type testPartitionType struct{}
type testPartitionType struct {
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh I though go fmt would clean that up but it didn't..

@@ -108,7 +108,7 @@ func (c *ConnectCALeaf) Fetch(opts cache.FetchOptions, req cache.Request) (cache
return result, errors.New("invalid RootCA response type")
}
if roots.TrustDomain == "" {
return result, errors.New("cluster has no CA bootstrapped")
return result, errors.New("cluster has no CA bootstrapped yet")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to cover both the case where Connect was enabled but the request raced with initial CA bootstrap and the case where Connect is not enabled at all. Now I've added an explucit error for not-enabled that is more descriptive, the only way (tm) this can be hit is during a race so I think adding "yet" helps to hint that the request was just a bit early and not that anything is misconfigured.

@banks
Copy link
Member Author

banks commented Jul 24, 2018

Last change broke a few explicit tests expecting the empty CA when connect is disabled behaviour. I'm fixing them as I can't think of a useful reason for that as mentioned.

Copy link
Contributor

@mitchellh mitchellh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Really minor changes requested.

The core logic in blockingQuery looks safe to me. Checking the client-sent min-index is so obviously the right thing to do to me its a 🤦‍♂️ moment.

func (c *Cache) entryKey(r *RequestInfo) string {
return fmt.Sprintf("%s/%s/%s", r.Datacenter, r.Token, r.Key)
func (c *Cache) entryKey(t string, r *RequestInfo) string {
return fmt.Sprintf("%s/%s/%s/%s", t, r.Datacenter, r.Token, r.Key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total nitpick but should the type be the prefix here? We don't do any cache key searching or lookups or anything fancy but least specificity => most usually has benefits. Super hypothetical but feels better.

// Exit early if Connect hasn't been enabled.
if !s.srv.config.ConnectEnabled {
return ErrConnectNotEnabled
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems important enough that it should have a unit test, also easy to test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does - they are all the ones that are broken right now as they asserted the old behaviour! Will be fixed soon.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting side effect of the incoming commit to convert all the Empty response tests into expecting errors - there is no easy way to test that empty response any more.

It's still legit during a race to bootstrap the cluster but we relied on just disabling connect not erroring as a side effect to provoke that case before.

I think it's OK and I'd rather not spend a lot of effort figuring out how to construct a mechanism to have connect enabled but to arbitrarily delay the actual bootstrapping just so we can get an empty response, esp. since we know it works now from before this PR. Feels odd that that expected mode isn't covered by a test any more though...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into this same thing and couldn't come up with a good test without making it really fragile, I think that's okay (to avoid it, not to make it fragile).

Copy link
Contributor

@preetapan preetapan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for writing down all the details in the issue description.

@banks
Copy link
Member Author

banks commented Jul 25, 2018

@mitchellh The test are committed and pass locally - travis is having a lot of trouble with this and seems to be failing on different transient issue despite me hitting retry a few times. They all pass locally though.

@pearkes pearkes added this to the 1.2.2 milestone Jul 25, 2018
@banks banks force-pushed the connect-disabled-cpu-burn branch from 09d6f33 to 31c0855 Compare July 25, 2018 19:17
@banks banks merged commit 8cbeb29 into master Jul 25, 2018
@banks banks deleted the connect-disabled-cpu-burn branch July 25, 2018 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants