Fixes #4421: General solution to stop blocking queries with index 0 #4437

banks · 2018-07-24T14:52:05Z

When developing Connect we hit a few edge cases where blocking queries would return X-Consul-Index of 0 along with a blank request.

The root cause is that some of our state functions (for example fetching the CARoots) have no state to load at all in some cases and so have no index value to read and end up returning meta.Index = 0.

In general our blocking query code assumes that no well-behaved blocking query ever has this behaviour and in some places where it's possible (e.g. KV fetches for non-existent keys) we have code to explicitly account for it. Some places account ofr it as above in the RPC endpoint code, some account for it in the actual state store transactions, but many still don't and rely on semantics or just plain unlikeliness of a given situation to never hit it.

During Connect we discussed the idea of fixing this once and for all within the Server.blockingQuery helper that all blocking queries use - we hypothesised this fix and were relatively sure it couldn't cause any damage as it's a fundamental assumption of the blocking query system and the only change in behaviour seems to be that it will prevent a tight busy loop that eats 100% CPU. But ultimately we didn't want to take the risk of adding the more general change that would affect non-connect endpoints.

Our solution which we thought covered all Connect APIs was to add the logic of using at least 1 as the index in the agent's cache fetcher for connect data.

However #4421 is a subtle edge case we missed even in connect endpoints.

The cause there is that the CA RPC response returns an empty result with 0 index, and the client dutifully upgrades that index to 1 for the next request. But the subsequent blocking request still has a 0 index returned from state.CARoots becuas there is still no CA config with connect disabled. The blockingQuery helper here used to check the returned meta was at least one before blocking as a proxy for "should blocking be enabled" so it never blocked in this case and caused the client agent to consume 100% CPU again.

General fix.

The actual fix here is just to switch the zero check from queryMeta.Index > 0 (the value returned from the state method) to queryOpts.MinQueryIndex > 0. As far as I understand this is a better semantic since we already explicitly checked for nil err from the method.

My rationale is:

we should block if the client requested it and the method's index is lower even if the state method returns 0
yes it's arguably buggy for state method to ever return 0 index with nil err but it happens and probably will again next time we add state methods or use existing ones in new ways where it starts to matter
it's never correct for a non-error response to have 0 index since it will cause clients to busy loop without blocking
it's therefore reasonable to ensure systematically that no (non-error) RPC response is ever sent with 0 index, and that even if the state method responds with zero index, we still block if requested to until some state is written to deliver back.

Tests

This is actually pretty easy to test and I think the tests added are sufficient. I was expecting to violate a bunch of other tests in the repo that expected zero indexes just because that's what they used to get but all tests pass locally - including all the watch package and all blocking-related stuff!

UX Error message

The original poster of #4421 didn't find the message about 'CA is not bootstrapped' to be clear enough. By adding an explicit enabled check for CARoots RPC the message is much improved:

$ curl -s http://127.0.0.1:8500/v1/agent/connect/ca/leaf/foo
Connect must be enabled in order to use this endpoint

@mitchellh @kyhavlov the Roots RPC was the only one that didn't have this check already. I wonder if it was intentional for some reason - a few tests used the fact to force empty results but with no rationale why that behaviour was desirable. I can't think of a good reason though to return an empty set of roots as if it is a valid request when connect is explicitly disabled? Returning empty if CA is not bootstrapped yet is important for races during bootstrap esp. in dev mode but if it's disabled it seems more helpful to be explicit about that?

banks · 2018-07-24T14:55:22Z

agent/cache-types/connect_ca_leaf.go

@@ -108,7 +108,8 @@ func (c *ConnectCALeaf) Fetch(opts cache.FetchOptions, req cache.Request) (cache
 		return result, errors.New("invalid RootCA response type")
 	}
 	if roots.TrustDomain == "" {
-		return result, errors.New("cluster has no CA bootstrapped")
+		return result, errors.New("cluster has no CA bootstrapped: ensure Connect" +
+			" is enabled in server config")


This is my attempt to make it more obvious to users what the cause is (likely) to be. We can't be more explicit since it's possible to hit this just due to timing before the CA bootstrap happens, especially in dev setups where CA bootstrap happens on every server restart and proxies might already be hitting the APIs before it does.

banks · 2018-07-24T14:56:37Z

agent/cache/cache.go

-func (c *Cache) entryKey(r *RequestInfo) string {
-	return fmt.Sprintf("%s/%s/%s", r.Datacenter, r.Token, r.Key)
+func (c *Cache) entryKey(t string, r *RequestInfo) string {
+	return fmt.Sprintf("%s/%s/%s/%s", t, r.Datacenter, r.Token, r.Key)


This is the fix for another hypothetical bug I found along the way. It was well discussed with Mitchell on Slack and the test case below highlights why it might be needed one day but can be ignored for purposes of reviewing the blockingQuery fix.

Total nitpick but should the type be the prefix here? We don't do any cache key searching or lookups or anything fancy but least specificity => most usually has benefits. Super hypothetical but feels better.

Hmm the type is the prefix here. It seemed to me that type is the most coarse grained partitioning we'd want (cache results from different operation types are entirely separate data sets whereas split by DC or token or key are finer grained.

So I'd agree with your high level point but either I'm not understanding why you think the cache type is less specific than the other elements, or there is confusion caused by the diff rendering highlighting the last /%s when really the new t value is being added as the first segment.

Oh right, the git diff just threw me off. Hah! Ignore me.

banks · 2018-07-24T14:56:59Z

agent/cache/cache_test.go

@@ -750,7 +798,8 @@ func TestCacheGet_partitionToken(t *testing.T) {
 // testPartitionType implements Type for testing that simply returns a value
 // comprised of the request DC and ACL token, used for testing cache
 // partitioning.
-type testPartitionType struct{}
+type testPartitionType struct {
+}


Huh I though go fmt would clean that up but it didn't..

banks · 2018-07-24T15:08:15Z

agent/cache-types/connect_ca_leaf.go

@@ -108,7 +108,7 @@ func (c *ConnectCALeaf) Fetch(opts cache.FetchOptions, req cache.Request) (cache
 		return result, errors.New("invalid RootCA response type")
 	}
 	if roots.TrustDomain == "" {
-		return result, errors.New("cluster has no CA bootstrapped")
+		return result, errors.New("cluster has no CA bootstrapped yet")


This used to cover both the case where Connect was enabled but the request raced with initial CA bootstrap and the case where Connect is not enabled at all. Now I've added an explucit error for not-enabled that is more descriptive, the only way (tm) this can be hit is during a race so I think adding "yet" helps to hint that the request was just a bit early and not that anything is misconfigured.

banks · 2018-07-24T15:23:46Z

Last change broke a few explicit tests expecting the empty CA when connect is disabled behaviour. I'm fixing them as I can't think of a useful reason for that as mentioned.

mitchellh

Looks great! Really minor changes requested.

The core logic in blockingQuery looks safe to me. Checking the client-sent min-index is so obviously the right thing to do to me its a 🤦‍♂️ moment.

mitchellh · 2018-07-24T15:35:02Z

agent/cache/cache.go

-func (c *Cache) entryKey(r *RequestInfo) string {
-	return fmt.Sprintf("%s/%s/%s", r.Datacenter, r.Token, r.Key)
+func (c *Cache) entryKey(t string, r *RequestInfo) string {
+	return fmt.Sprintf("%s/%s/%s/%s", t, r.Datacenter, r.Token, r.Key)


Total nitpick but should the type be the prefix here? We don't do any cache key searching or lookups or anything fancy but least specificity => most usually has benefits. Super hypothetical but feels better.

mitchellh · 2018-07-24T15:35:39Z

agent/consul/connect_ca_endpoint.go

+	// Exit early if Connect hasn't been enabled.
+	if !s.srv.config.ConnectEnabled {
+		return ErrConnectNotEnabled
+	}


This seems important enough that it should have a unit test, also easy to test.

It does - they are all the ones that are broken right now as they asserted the old behaviour! Will be fixed soon.

Interesting side effect of the incoming commit to convert all the Empty response tests into expecting errors - there is no easy way to test that empty response any more.

It's still legit during a race to bootstrap the cluster but we relied on just disabling connect not erroring as a side effect to provoke that case before.

I think it's OK and I'd rather not spend a lot of effort figuring out how to construct a mechanism to have connect enabled but to arbitrarily delay the actual bootstrapping just so we can get an empty response, esp. since we know it works now from before this PR. Feels odd that that expected mode isn't covered by a test any more though...

I ran into this same thing and couldn't come up with a good test without making it really fragile, I think that's okay (to avoid it, not to make it fragile).

preetapan

LGTM, thanks for writing down all the details in the issue description.

banks · 2018-07-25T14:07:01Z

@mitchellh The test are committed and pass locally - travis is having a lot of trouble with this and seems to be failing on different transient issue despite me hitting retry a few times. They all pass locally though.

…ith same result type

…return zero index

… not enabled when they try to request cert

banks requested review from mitchellh, preetapan and mkeeler July 24, 2018 14:52

banks commented Jul 24, 2018

View reviewed changes

mitchellh suggested changes Jul 24, 2018

View reviewed changes

preetapan approved these changes Jul 24, 2018

View reviewed changes

mitchellh approved these changes Jul 25, 2018

View reviewed changes

pearkes added this to the 1.2.2 milestone Jul 25, 2018

banks added 7 commits July 25, 2018 20:17

Fix theoretical cache collision bug if/when we use more cache types w…

d6feb99

…ith same result type

Generalized fix for blocking query handling when state store methods …

080d7db

…return zero index

Refactor test retry to only affect CI

b1f0be2

Undo make file merge

79f4632

Add hint to error message returned to end-user requests if Connect is…

03befbd

… not enabled when they try to request cert

Explicit error for Roots endpoint if connect is disabled

009dc8e

Fix tests that were asserting old behaviour

31c0855

banks force-pushed the connect-disabled-cpu-burn branch from 09d6f33 to 31c0855 Compare July 25, 2018 19:17

banks merged commit 8cbeb29 into master Jul 25, 2018

banks deleted the connect-disabled-cpu-burn branch July 25, 2018 19:26

mbag mentioned this pull request Aug 8, 2018

Fixes #4483: Add support for Authorization: Bearer token Header #4502

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes #4421: General solution to stop blocking queries with index 0 #4437

Fixes #4421: General solution to stop blocking queries with index 0 #4437

banks commented Jul 24, 2018 •

edited

Loading

banks Jul 24, 2018

banks Jul 24, 2018

mitchellh Jul 24, 2018

banks Jul 24, 2018 •

edited

Loading

mitchellh Jul 24, 2018

banks Jul 24, 2018

banks Jul 24, 2018

banks commented Jul 24, 2018

mitchellh left a comment

mitchellh Jul 24, 2018

mitchellh Jul 24, 2018

banks Jul 24, 2018

banks Jul 24, 2018

mitchellh Jul 24, 2018

preetapan left a comment

banks commented Jul 25, 2018

Fixes #4421: General solution to stop blocking queries with index 0 #4437

Fixes #4421: General solution to stop blocking queries with index 0 #4437

Conversation

banks commented Jul 24, 2018 • edited Loading

General fix.

Tests

UX Error message

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

banks Jul 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

banks commented Jul 24, 2018

mitchellh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

preetapan left a comment

Choose a reason for hiding this comment

banks commented Jul 25, 2018

banks commented Jul 24, 2018 •

edited

Loading

banks Jul 24, 2018 •

edited

Loading